19
Lemmatization for variation-rich languages using deep learning ............................................................................................................................................................ Mike Kestemont, Guy de Pauw, Renske van Nie, and Walter Daelemans University of Antwerp, Belgium ....................................................................................................................................... Abstract In this article, we describe a novel approach to sequence tagging for languages that are rich in (e.g. orthographic) surface variation. We focus on lemmatization, a basic step in many processing pipelines in the Digital Humanities. While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and unstable orthography. Our approach is based on recent advances in the field of ‘deep’ representation learning, where neural networks have led to a dramatic increase in performance across several domains. The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings to represent the lexical context surrounding the input words. We demonstrate how this system reaches state-of- the-art performance on a number of representative Middle Dutch data sets, even without corpus-specific parameter tuning. ................................................................................................................................................................................. 1 Introduction Sequence tagging is a central problem in Natural Language Processing (NLP) Jurafsky and Martin (2009). It includes well-known applications such as part-of-speech (PoS) tagging or lemmatization Chrupala et al., 2008 and Toutanova et al., 2003, which are basic, yet foundational stages in many text processing pipelines in the Digital Humanities. Lemmatization offers a form of nor- malization which helps to more efficiently process historic text corpora such as newspaper databases, ranging from plain searching Manning et al. (2008) to advanced text categorization tasks in e.g. stylom- etry Van Dalen-Oskam and Van Zundert, 2007. For present-day resource-rich languages, such as English, the task of lemmatization is often con- sidered solved. However, there exist many languages for which these tasks are much more difficult. First, there are the historic precursors of modern languages. Most historic stages of present-day lan- guages are characterized by the absence of a generally accepted standard language and orthography. In pre- modern times, language users typically enjoyed a rela- tively larger freedom with respect to spelling and spa- cing, since there only existed local (or at most regional) orthographical conventions (Piotrowski, 2012). Therefore, the same word could appear in multiple, roughly equivalent spellings—even within the same text—which obviously presents major challenges to computational applications when it comes to word identification. The recent, dramatic growth in the availability of historic text corpora in electronic form has only increased users’ needs to be able to efficiently deal with this kind of data in computer applications (Ernst-Gerlach and Fuhr, 2006). Most systems de- veloped for the lemmatization of present-day lan- guages are not equipped to deal with large amounts Correspondence: Mike Kestemont, University of Antwerp, Prinstraat 13 (S.D. 118), 2000 Antwerp, Belgium. E-mail: [email protected] Digital Scholarship in the Humanities ß The Author 2016. Published by Oxford University Press on behalf of EADH. All rights reserved. For Permissions, please email: [email protected] 1 of 19 doi:10.1093/llc/fqw034 Digital Scholarship in the Humanities Advance Access published August 26, 2016 by guest on January 3, 2017 http://dsh.oxfordjournals.org/ Downloaded from

Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

  • Upload
    others

  • View
    9

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Lemmatization for variation-richlanguages using deep learning

Mike Kestemont Guy de Pauw Renske van Nie and

Walter Daelemans

University of Antwerp Belgium

AbstractIn this article we describe a novel approach to sequence tagging for languagesthat are rich in (eg orthographic) surface variation We focus on lemmatizationa basic step in many processing pipelines in the Digital Humanities While thistask has long been considered solved for modern languages such as English thereexist many (eg historic) languages for which the problem is harder to solve dueto a lack of resources and unstable orthography Our approach is based on recentadvances in the field of lsquodeeprsquo representation learning where neural networkshave led to a dramatic increase in performance across several domains Theproposed system combines two approaches on the one hand we apply temporalconvolutions to model the orthography of input words at the character levelsecondly we use distributional word embeddings to represent the lexical contextsurrounding the input words We demonstrate how this system reaches state-of-the-art performance on a number of representative Middle Dutch data sets evenwithout corpus-specific parameter tuning

1 Introduction

Sequence tagging is a central problem in NaturalLanguage Processing (NLP) Jurafsky and Martin(2009) It includes well-known applications suchas part-of-speech (PoS) tagging or lemmatizationChrupala et al 2008 and Toutanova et al 2003which are basic yet foundational stages in manytext processing pipelines in the DigitalHumanities Lemmatization offers a form of nor-malization which helps to more efficiently processhistoric text corpora such as newspaper databasesranging from plain searching Manning et al (2008)to advanced text categorization tasks in eg stylom-etry Van Dalen-Oskam and Van Zundert 2007 Forpresent-day resource-rich languages such asEnglish the task of lemmatization is often con-sidered solved However there exist many languagesfor which these tasks are much more difficult

First there are the historic precursors of modernlanguages Most historic stages of present-day lan-guages are characterized by the absence of a generallyaccepted standard language and orthography In pre-modern times language users typically enjoyed a rela-tively larger freedom with respect to spelling and spa-cing since there onlyexisted local (orat most regional)orthographical conventions (Piotrowski 2012)Therefore the same word could appear in multipleroughly equivalent spellingsmdasheven within the sametextmdashwhich obviously presents major challenges tocomputational applications when it comes to wordidentification The recent dramatic growth in theavailability of historic text corpora in electronic formhas only increased usersrsquo needs to be able to efficientlydeal with this kind of data in computer applications(Ernst-Gerlach and Fuhr 2006) Most systems de-veloped for the lemmatization of present-day lan-guages are not equipped to deal with large amounts

Correspondence

Mike Kestemont University

of Antwerp Prinstraat 13

(SD 118) 2000 Antwerp

Belgium

E-mail

mikekestemontuantwerpbe

Digital Scholarship in the Humanities The Author 2016 Published by Oxford University Press on behalf of EADHAll rights reserved For Permissions please email journalspermissionsoupcom

1 of 19

doi101093llcfqw034

Digital Scholarship in the Humanities Advance Access published August 26 2016 by guest on January 3 2017

httpdshoxfordjournalsorgD

ownloaded from

of orthographic variation in the input and struggle toreliably identify unknown words in unseen textsNaturally this results in a considerable percolationof errors to higher-end layers in a more complex pro-cessing pipeline (eg full syntactic parsers)

Secondly there exist many present-day resource-scarce languages for which far fewer resources areavailable eg in the form of annotated corporaBoth historic and resource-scarce languages makeup the focus of many Digital Humanities projectstargeting the better exploitation or preservation of(specific cultural artefacts in) these languages Forthese too however many of the available lemma-tizers are not sufficiently equipped to deal with theadvanced sub-word level variation nor to optimallyexploit the few resources that are available

Needless to say the availability of efficient lemma-tizers for historic andor low-resource languages istherefore a desideratum across various fields in huma-nities computing In this article we describe a novellanguage-independent approach to sequence taggingfor variation-rich languages To this end we applya setof techniques from the field of lsquoDeeprsquo Representationlearning a family of algorithms from MachineLearning basedon neural networks whichhas recentlyled to a dramatic increase in performance across vari-ous domains In this article we will first survey some ofthe most relevant related research before we go on tointroduce the field of deep learning We then describethe architecture of our lemmatizer and discuss theMiddle Dutch data sets on which we will evaluateour system After presenting the results we offer aninterpretative exploration of our trained models Weconclude with a set of pointers for future researchstressing the considerable potential of representationlearning for the Humanities at large

2 Related Research

Lemmatization is closely related to morphological ana-lysis and PoS tagging which are a popular researchdomain in computational linguistics with recent stu-dies especially focusing on unsupervised learning andon(low-resource) languages witha rich morphologyorinflectional system (eg Mitankin et al 2014) An im-portant boost in NLP research for such languages has

been inspired by the increase in computer-mediatedcommunication (CMC Crystal 2001) including re-search into the normalization of Instant Messaging orposts on online blogging platforms such as the popularmicro-blogging service Twitter In CMC too we find awildproliferationof language variationinwritten com-munication especially affecting surface phenomenasuch as spelling A common solution to the problemof spelling variants in CMC is to apply some form ofspelling normalization before subsequent processingSimilar research has been reported in the field of post-Optical Character Recognition(OCR) error correction(Reynaert et al 2012) Spelling normalization essen-tially involves replacing a non-canonical surface tokenby a standard form (eg Beaufort et al 2010 Chrupala2014) as would be done with spelling correctors inmodern word processing applications In socialmedia a variety of approaches have been suggested(Schulz et al 2016) including eg transliterationapproaches borrowed from Machine Translation (egLjubesic et al 2014)

The field of NLP for Historical Languages has re-cently been surveyed in a dedicated monograph(Piotrowski 2012) The problem of orthographicalvariation naturally plays a major role in this over-view Recent representative contributions in thisfield include Hendrickx and Marquilhas (2011) andReynaert et al (2012) for historical PortugueseScherrer and Erjavec (2013) for historical SloveneBollmann (2012) for Early New High German orBouma and Hermans (2012) on the automated syl-labification of Middle Dutch One characteristic fea-ture that sets this kind of research into present-daylanguages apart from historical language research isthat for modern languages researchers typically areable to normalize noisy language data into an exist-ing canonical standard form Nevertheless the ma-jority of historical languages lack such a canonicalorthographic variant and therefore this solutioncan rarely be applied (Souvay and Pierrel 2009) inthe absence of a standard language words simply donot have a single canonical spelling by which othernon-standard spellings can be replaced It is thereforealso common to annotate words with other sorts oflabels instead of attempting an explicit respellingLemmas are often used in this respect Van derVoort van der Kleij 2005 a lemma is a normalized

M Kestemont et al

2 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

label which unambiguously links words to the sameentry (headword) in a lexical resource such as a dic-tionary if they only differ in inflection or spelling(Knowles and Mohd Don 2004)

In the present article we conceive of lemmatiza-tion as a high-dimensional classification task wherelemmas are assigned to tokens as class labels (iedictionary headwords are the systemrsquos output)Therefore our lemmatizer does not perform any ex-plicit PoS disambiguation (although the system couldeasily be trained on PoS labels too) Because of thisclassification setting the proposed lemmatizer willnot be able to predict new unseen lemmas at testtime because it has not seen these labels during trai-ning(cf Chrupala et al 2008) However this also pre-vents the lemmatizer from outputting non-existentlsquogibberishrsquo labels which is desirable given the limitedsize of our data sets Lemmatization too has been anactive area of research in computational linguisticsespecially for languages with a rich inflectionalsystem (Daelemans et al 2009 Scherrer and Erjavec2013 De Pauw and De Schryver 2008 Lyras et al2008) Many studies apply a combination of existingtaggers (for contextual disambiguation) with an opti-mized spelling normalization component to betterdeal with unknown words

One rough distinction which could be made is be-tween lsquoonlinersquo and lsquoofflinersquo approaches In lsquoonlinersquoapproaches when confronted with an unknownword at the testing stage systems will explicitly at-tempt to identify the unknown word as a spelling vari-ant of a known token (ie a token which was seenduring training) before applying any disambiguationroutines To this end researchers typically apply acombination of (weighted) string distance measures(eg Levenshtein distance Levenshtein 1966) Whilethe online strategy might result in a high recall it canalso be expensive to apply because the unknown tokenhas to be matched with all known tokens and somestring distance measures can be costly to apply Theoffline strategy follows a different road at trainingtime it will attempt to generate new plausible spellingvariants of known tokens through aligning words andapplyingcommoneditpatternstogeneratenewformsAt the testing stage new words can simply be matchedagainst the newly generated forms Interestingly thisapproach will be fast to apply during testing (and

might result in a high precision) but it might alsosuffer from the noise being introduced in the variantgeneration phase (eg overlapping variants)1 VanHalteren and Rem (2013) have reported the successfulapplication of such an approach which is in manyways reminiscent of the generation of search queryvariants in Information Retrieval research

3 lsquoDeeprsquo Representation Learning

In this article we describe a novel system forlemmatization using lsquodeeprsquo representation learningwhich has barely been applied to the problem oflemmatization for historic andor low-resource lan-guages specifically Importantly we will show that asingle system reaches state-of-the-art tagging per-formance across a variety of data sets even withoutthe problem-specific fine-tuning of hyper-param-eters Deep learning is a popular paradigm in ma-chine learning (LeCun et al 2015) It is based onthe general idea of multi-layer neural networks soft-ware systems in which information is propagatedthrough a layered architecture of inter-connectedneurons (or information units) that iterativelytransform the input information and feed it to sub-sequent layers in the networks While theoreticallythe idea of neural networks has been around for along time it has only recently become feasible totrain more complex architectures because of themillions of parameters that typically have to be opti-mized during training

Deep learning is part of a broader field calledrepresentation learning or feature learning (Bengioet al 2013) Deep learning can indeed be contrastedwith more traditional approaches in machine learn-ing in that it is very much geared towards findinggood representations in the input data or extractlsquofeaturesrsquo from it Roughly put traditional learningalgorithms would be very much dependent on aresearcherrsquos representation of the input data andwould try to solve a task on the basis of the particu-lar features suggested and manually engineered bythe researchers In Deep Learning the process offeature engineering is to a larger extent outsourcedto the learning algorithm itself Instead of solelyrelying on the pre-specified features in the input

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 3 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

data (ie hand-crafted by humans) deep neural net-works will attempt to independently recombine theexisting input features into new combinations offeatures which typically offer a better representationof the input data It does so by iteratively projectingthe input onto subsequent layers of informationunits which are typically said to increasingly cap-ture more abstract and complex patterns in theinput data (Bengio et al 2013)

The intuition behind deep learning is typicallyillustrated using an example from computervision the field where some of the earliest break-throughs in Deep Learning have been reported (eghandwritten digit recognition LeCun et al 1990)In computer vision images are typically representedas a two-dimensional raster of pixels which can takecertain values across a series of input channels (egone for every color in the RGB spectrum) Whenpropagating an input image through a layeredneural network it has been shown that the earliestlayers in the network will be receptive to very rawprimitive shapes in the data such as a corner partsof a curve or a stark light-dark contrast calledlsquoedgesrsquo (LeCun et al 2015) Only at subsequentlayers in the network these primitive forms arerecombined into higher-level units such as an eyeor ear At still higher layers the network will even-tually become sensible to complex units such asfaces It can therefore be said that such neural net-works are able to learn increasingly abstract(re)combinations of the original features in theinput (Bengio et al 2013) Interestingly it hasbeen shown that visual perception in mammalsworks in a highly similar hierarchical fashion(Cahieu et al 2014) Because of its hierarchicalnature this form of learning is commonly calledlsquodeeprsquo since it increasingly captures lsquodeeperrsquo aspectsof the problem it is working on

4 Architecture

The system architecture we describe here primarilybuilds upon two basic components derived fromrecent studies in Deep Learning for NLP (i) one-dimensional or lsquotemporalrsquo lsquoconvolutionsrsquo tomodel the orthography of input words at the

character-level and (ii) word lsquoembeddingsrsquo tomodel the lexical context surrounding the inputtokens for the purpose of disambiguation We willfirst introduce the type of lsquosubnetsrsquo associated withboth components before we go on to discuss howour architecture eventually combines these subnets

41 ConvolutionsConvolutions are a popular approach in computervision to battle the problem of lsquotranslationsrsquo inimages (LeCun et al 1990 2015) Consider anobject classification system whose task it is todetect the presence of certain objects in an imageIf during training the system has come across eg abanana in the upper-left corner of a picture wewould like the system to be able to detect thesame object in new images even if the exact locationof that object has shifted (eg towards the lower-right corner of the image) Convolutions can bethought of as a fixed size window (eg 55 pixels)which gets slided over the entire input image in astepwise fashion Convolutions are therefore alsosaid to learn lsquofiltersrsquo in that they learn a filteringmechanism to detect the presence of certain fea-tures irrespective of the exact position in whichthese features occur in the image Importantlyeach filter will scan the entire image for the localpresence of a particular low-level feature (eg theyellow-coloured curved edge of the banana) anddetect it irrespective of the exact position of thefeature (eg lower-right corner versus upper-leftcorner) Typically convolutions are applied as theinitial layers in a neural network before passing onthe activation of filters to subsequent layers in thenetwork where the detected features can be recom-bined into more complex features (eg an entirebanana) Whereas convolutions in vision researchare typically two-dimensional (cf raster of pixels)here we apply convolutions over a single dimensionnamely the sequence of characters in an input wordThe idea to apply convolutions to character series inNLP has recently been pioneered by Zhang et al(2015) reporting strong results across a variety ofhigher-end text classification tasks Such convolu-tions have also been called lsquotemporalrsquo convolutionsbecause they model a linear sequence of items

M Kestemont et al

4 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Representing input words as a sequence of char-acters (ie at the sub-word level) has a considerableadvantage over more traditional approaches in se-quence tagging Here input words are typically rep-resented using a lsquoone-hot vocabulary encodingrsquogiven an indexed vocabulary of n known tokenseach token is represented as a long vector of nvalues in which the index of one token is set to 1whereas all other values are set to 0 Naturally thisleads to very sparse high-dimensional word repre-sentations In such a one-hot representation it istherefore not uncommon to apply a cut-off andonly encode the more frequent vocabulary itemsto battle the sparsity of the input data Of coursea one-hot vector representation (especially with asevere cut-off) cannot represent new unseentokens at test time since these do not form partof the indexed training vocabulary Most systemsfor tasks like PoS tagging would in those casesback off to additional (largely hand-crafted) featuresto represent the target token such as the final char-acter trigram of the token to capture the tokenrsquosinflection (eg Zavrel and Daelemans 1999)

The character-level system presented here doesnot struggle to model out-of-vocabulary wordsThis is a worthwhile characteristic with respect tothe present use case in languages with a rich ortho-graphical and morphological variation the dimen-sionality of word-level one-hot vectors dramaticallyincreases because the token vocabulary is muchlarger and thus the effect of applying a cut-offwill also be more dramatic (Kestemont et al2010) Additionally the proportion of unknowntarget tokens at test time is of course much moreprominent than in traditional languages due tospelling variation Therefore our architecture aimsto complement or even replace a conventionalone-hot encoding with a convolutional component

With respect to the convolutional subnet oursystem models each input token as a sequence ofcharacters in which each character is representedas a one-hot vector at the character level As theexample in Table 1 shows this leads to an at-first-sight primitive yet much lower-dimensional tokenrepresentation compared to a traditional token-levelone-hot encoding which additionally is able torepresent out-of-vocabulary items We include all

characters in the character index used We specifya certain threshold t (eg 15 characters) wordslonger or shorter than this threshold are respect-ively truncated to the right to length t or are paddedwith zero-filled vectors In an analogy to computervision research this representation models charac-ters as if they were lsquopixelsrsquo with a channel dedicatedto a single character

Our model now slides convolutional filters overthis two-dimensional representation with a lengthof 3 and a stride (or lsquostep sizersquo) of 1 In practice thiswill mean that the convolutional layer will itera-tively inspect partially overlapping consecutivecharacter trigrams in the input string This repre-sentation is inspired by the generic approach to se-quence modelling in NLP followed by Zhang et al(2015 character-level convolutions) as well asDieleman and Schrauwen (2014 one-dimensionalconvolutions for music) and Kalchbrenner et al(2014 word-level convolutions on the basis of pre-trained embeddings) At the input stage each targettoken is represented by a n t matrix where nrepresents the number of distinct characters in adata set and t is the uniform length to which eachtarget token is normalized through padding withtrailing zeroes or through cutting characters

We motivate the choice for a convolution ap-proach to model orthography on the basis of thefollowing considerations When using a filter sizeof eg 3 the receptive field size of the learned filtersroughly corresponds to that of syllable-like groupsin many languages (eg consonant-vowel-consonantgroups) One would therefore expect the filters tobecome sensitive morpheme-like characteristicsThe most interesting aspect of convolutions seemsthat the learned filters are being detected across allpositions in the input string which is likely torender the detection of morphemes less sensitiveto their exact position in a word This is a valuableproperty for our corpora since the non-standardorthography often allows the introduction of silentcharacters (eg lsquoghrsquo for lsquohrsquo) causing local charactershifts in the input or the kind of lsquotranslationsrsquo incomputer vision to which we know convolutionsare fairly robust (lsquogelikrsquo vs lsquoghelikrsquo)2 Finally itshould be stressed that the learned filters can beexpected to offer a flexible string representation a

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 5 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 2: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

of orthographic variation in the input and struggle toreliably identify unknown words in unseen textsNaturally this results in a considerable percolationof errors to higher-end layers in a more complex pro-cessing pipeline (eg full syntactic parsers)

Secondly there exist many present-day resource-scarce languages for which far fewer resources areavailable eg in the form of annotated corporaBoth historic and resource-scarce languages makeup the focus of many Digital Humanities projectstargeting the better exploitation or preservation of(specific cultural artefacts in) these languages Forthese too however many of the available lemma-tizers are not sufficiently equipped to deal with theadvanced sub-word level variation nor to optimallyexploit the few resources that are available

Needless to say the availability of efficient lemma-tizers for historic andor low-resource languages istherefore a desideratum across various fields in huma-nities computing In this article we describe a novellanguage-independent approach to sequence taggingfor variation-rich languages To this end we applya setof techniques from the field of lsquoDeeprsquo Representationlearning a family of algorithms from MachineLearning basedon neural networks whichhas recentlyled to a dramatic increase in performance across vari-ous domains In this article we will first survey some ofthe most relevant related research before we go on tointroduce the field of deep learning We then describethe architecture of our lemmatizer and discuss theMiddle Dutch data sets on which we will evaluateour system After presenting the results we offer aninterpretative exploration of our trained models Weconclude with a set of pointers for future researchstressing the considerable potential of representationlearning for the Humanities at large

2 Related Research

Lemmatization is closely related to morphological ana-lysis and PoS tagging which are a popular researchdomain in computational linguistics with recent stu-dies especially focusing on unsupervised learning andon(low-resource) languages witha rich morphologyorinflectional system (eg Mitankin et al 2014) An im-portant boost in NLP research for such languages has

been inspired by the increase in computer-mediatedcommunication (CMC Crystal 2001) including re-search into the normalization of Instant Messaging orposts on online blogging platforms such as the popularmicro-blogging service Twitter In CMC too we find awildproliferationof language variationinwritten com-munication especially affecting surface phenomenasuch as spelling A common solution to the problemof spelling variants in CMC is to apply some form ofspelling normalization before subsequent processingSimilar research has been reported in the field of post-Optical Character Recognition(OCR) error correction(Reynaert et al 2012) Spelling normalization essen-tially involves replacing a non-canonical surface tokenby a standard form (eg Beaufort et al 2010 Chrupala2014) as would be done with spelling correctors inmodern word processing applications In socialmedia a variety of approaches have been suggested(Schulz et al 2016) including eg transliterationapproaches borrowed from Machine Translation (egLjubesic et al 2014)

The field of NLP for Historical Languages has re-cently been surveyed in a dedicated monograph(Piotrowski 2012) The problem of orthographicalvariation naturally plays a major role in this over-view Recent representative contributions in thisfield include Hendrickx and Marquilhas (2011) andReynaert et al (2012) for historical PortugueseScherrer and Erjavec (2013) for historical SloveneBollmann (2012) for Early New High German orBouma and Hermans (2012) on the automated syl-labification of Middle Dutch One characteristic fea-ture that sets this kind of research into present-daylanguages apart from historical language research isthat for modern languages researchers typically areable to normalize noisy language data into an exist-ing canonical standard form Nevertheless the ma-jority of historical languages lack such a canonicalorthographic variant and therefore this solutioncan rarely be applied (Souvay and Pierrel 2009) inthe absence of a standard language words simply donot have a single canonical spelling by which othernon-standard spellings can be replaced It is thereforealso common to annotate words with other sorts oflabels instead of attempting an explicit respellingLemmas are often used in this respect Van derVoort van der Kleij 2005 a lemma is a normalized

M Kestemont et al

2 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

label which unambiguously links words to the sameentry (headword) in a lexical resource such as a dic-tionary if they only differ in inflection or spelling(Knowles and Mohd Don 2004)

In the present article we conceive of lemmatiza-tion as a high-dimensional classification task wherelemmas are assigned to tokens as class labels (iedictionary headwords are the systemrsquos output)Therefore our lemmatizer does not perform any ex-plicit PoS disambiguation (although the system couldeasily be trained on PoS labels too) Because of thisclassification setting the proposed lemmatizer willnot be able to predict new unseen lemmas at testtime because it has not seen these labels during trai-ning(cf Chrupala et al 2008) However this also pre-vents the lemmatizer from outputting non-existentlsquogibberishrsquo labels which is desirable given the limitedsize of our data sets Lemmatization too has been anactive area of research in computational linguisticsespecially for languages with a rich inflectionalsystem (Daelemans et al 2009 Scherrer and Erjavec2013 De Pauw and De Schryver 2008 Lyras et al2008) Many studies apply a combination of existingtaggers (for contextual disambiguation) with an opti-mized spelling normalization component to betterdeal with unknown words

One rough distinction which could be made is be-tween lsquoonlinersquo and lsquoofflinersquo approaches In lsquoonlinersquoapproaches when confronted with an unknownword at the testing stage systems will explicitly at-tempt to identify the unknown word as a spelling vari-ant of a known token (ie a token which was seenduring training) before applying any disambiguationroutines To this end researchers typically apply acombination of (weighted) string distance measures(eg Levenshtein distance Levenshtein 1966) Whilethe online strategy might result in a high recall it canalso be expensive to apply because the unknown tokenhas to be matched with all known tokens and somestring distance measures can be costly to apply Theoffline strategy follows a different road at trainingtime it will attempt to generate new plausible spellingvariants of known tokens through aligning words andapplyingcommoneditpatternstogeneratenewformsAt the testing stage new words can simply be matchedagainst the newly generated forms Interestingly thisapproach will be fast to apply during testing (and

might result in a high precision) but it might alsosuffer from the noise being introduced in the variantgeneration phase (eg overlapping variants)1 VanHalteren and Rem (2013) have reported the successfulapplication of such an approach which is in manyways reminiscent of the generation of search queryvariants in Information Retrieval research

3 lsquoDeeprsquo Representation Learning

In this article we describe a novel system forlemmatization using lsquodeeprsquo representation learningwhich has barely been applied to the problem oflemmatization for historic andor low-resource lan-guages specifically Importantly we will show that asingle system reaches state-of-the-art tagging per-formance across a variety of data sets even withoutthe problem-specific fine-tuning of hyper-param-eters Deep learning is a popular paradigm in ma-chine learning (LeCun et al 2015) It is based onthe general idea of multi-layer neural networks soft-ware systems in which information is propagatedthrough a layered architecture of inter-connectedneurons (or information units) that iterativelytransform the input information and feed it to sub-sequent layers in the networks While theoreticallythe idea of neural networks has been around for along time it has only recently become feasible totrain more complex architectures because of themillions of parameters that typically have to be opti-mized during training

Deep learning is part of a broader field calledrepresentation learning or feature learning (Bengioet al 2013) Deep learning can indeed be contrastedwith more traditional approaches in machine learn-ing in that it is very much geared towards findinggood representations in the input data or extractlsquofeaturesrsquo from it Roughly put traditional learningalgorithms would be very much dependent on aresearcherrsquos representation of the input data andwould try to solve a task on the basis of the particu-lar features suggested and manually engineered bythe researchers In Deep Learning the process offeature engineering is to a larger extent outsourcedto the learning algorithm itself Instead of solelyrelying on the pre-specified features in the input

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 3 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

data (ie hand-crafted by humans) deep neural net-works will attempt to independently recombine theexisting input features into new combinations offeatures which typically offer a better representationof the input data It does so by iteratively projectingthe input onto subsequent layers of informationunits which are typically said to increasingly cap-ture more abstract and complex patterns in theinput data (Bengio et al 2013)

The intuition behind deep learning is typicallyillustrated using an example from computervision the field where some of the earliest break-throughs in Deep Learning have been reported (eghandwritten digit recognition LeCun et al 1990)In computer vision images are typically representedas a two-dimensional raster of pixels which can takecertain values across a series of input channels (egone for every color in the RGB spectrum) Whenpropagating an input image through a layeredneural network it has been shown that the earliestlayers in the network will be receptive to very rawprimitive shapes in the data such as a corner partsof a curve or a stark light-dark contrast calledlsquoedgesrsquo (LeCun et al 2015) Only at subsequentlayers in the network these primitive forms arerecombined into higher-level units such as an eyeor ear At still higher layers the network will even-tually become sensible to complex units such asfaces It can therefore be said that such neural net-works are able to learn increasingly abstract(re)combinations of the original features in theinput (Bengio et al 2013) Interestingly it hasbeen shown that visual perception in mammalsworks in a highly similar hierarchical fashion(Cahieu et al 2014) Because of its hierarchicalnature this form of learning is commonly calledlsquodeeprsquo since it increasingly captures lsquodeeperrsquo aspectsof the problem it is working on

4 Architecture

The system architecture we describe here primarilybuilds upon two basic components derived fromrecent studies in Deep Learning for NLP (i) one-dimensional or lsquotemporalrsquo lsquoconvolutionsrsquo tomodel the orthography of input words at the

character-level and (ii) word lsquoembeddingsrsquo tomodel the lexical context surrounding the inputtokens for the purpose of disambiguation We willfirst introduce the type of lsquosubnetsrsquo associated withboth components before we go on to discuss howour architecture eventually combines these subnets

41 ConvolutionsConvolutions are a popular approach in computervision to battle the problem of lsquotranslationsrsquo inimages (LeCun et al 1990 2015) Consider anobject classification system whose task it is todetect the presence of certain objects in an imageIf during training the system has come across eg abanana in the upper-left corner of a picture wewould like the system to be able to detect thesame object in new images even if the exact locationof that object has shifted (eg towards the lower-right corner of the image) Convolutions can bethought of as a fixed size window (eg 55 pixels)which gets slided over the entire input image in astepwise fashion Convolutions are therefore alsosaid to learn lsquofiltersrsquo in that they learn a filteringmechanism to detect the presence of certain fea-tures irrespective of the exact position in whichthese features occur in the image Importantlyeach filter will scan the entire image for the localpresence of a particular low-level feature (eg theyellow-coloured curved edge of the banana) anddetect it irrespective of the exact position of thefeature (eg lower-right corner versus upper-leftcorner) Typically convolutions are applied as theinitial layers in a neural network before passing onthe activation of filters to subsequent layers in thenetwork where the detected features can be recom-bined into more complex features (eg an entirebanana) Whereas convolutions in vision researchare typically two-dimensional (cf raster of pixels)here we apply convolutions over a single dimensionnamely the sequence of characters in an input wordThe idea to apply convolutions to character series inNLP has recently been pioneered by Zhang et al(2015) reporting strong results across a variety ofhigher-end text classification tasks Such convolu-tions have also been called lsquotemporalrsquo convolutionsbecause they model a linear sequence of items

M Kestemont et al

4 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Representing input words as a sequence of char-acters (ie at the sub-word level) has a considerableadvantage over more traditional approaches in se-quence tagging Here input words are typically rep-resented using a lsquoone-hot vocabulary encodingrsquogiven an indexed vocabulary of n known tokenseach token is represented as a long vector of nvalues in which the index of one token is set to 1whereas all other values are set to 0 Naturally thisleads to very sparse high-dimensional word repre-sentations In such a one-hot representation it istherefore not uncommon to apply a cut-off andonly encode the more frequent vocabulary itemsto battle the sparsity of the input data Of coursea one-hot vector representation (especially with asevere cut-off) cannot represent new unseentokens at test time since these do not form partof the indexed training vocabulary Most systemsfor tasks like PoS tagging would in those casesback off to additional (largely hand-crafted) featuresto represent the target token such as the final char-acter trigram of the token to capture the tokenrsquosinflection (eg Zavrel and Daelemans 1999)

The character-level system presented here doesnot struggle to model out-of-vocabulary wordsThis is a worthwhile characteristic with respect tothe present use case in languages with a rich ortho-graphical and morphological variation the dimen-sionality of word-level one-hot vectors dramaticallyincreases because the token vocabulary is muchlarger and thus the effect of applying a cut-offwill also be more dramatic (Kestemont et al2010) Additionally the proportion of unknowntarget tokens at test time is of course much moreprominent than in traditional languages due tospelling variation Therefore our architecture aimsto complement or even replace a conventionalone-hot encoding with a convolutional component

With respect to the convolutional subnet oursystem models each input token as a sequence ofcharacters in which each character is representedas a one-hot vector at the character level As theexample in Table 1 shows this leads to an at-first-sight primitive yet much lower-dimensional tokenrepresentation compared to a traditional token-levelone-hot encoding which additionally is able torepresent out-of-vocabulary items We include all

characters in the character index used We specifya certain threshold t (eg 15 characters) wordslonger or shorter than this threshold are respect-ively truncated to the right to length t or are paddedwith zero-filled vectors In an analogy to computervision research this representation models charac-ters as if they were lsquopixelsrsquo with a channel dedicatedto a single character

Our model now slides convolutional filters overthis two-dimensional representation with a lengthof 3 and a stride (or lsquostep sizersquo) of 1 In practice thiswill mean that the convolutional layer will itera-tively inspect partially overlapping consecutivecharacter trigrams in the input string This repre-sentation is inspired by the generic approach to se-quence modelling in NLP followed by Zhang et al(2015 character-level convolutions) as well asDieleman and Schrauwen (2014 one-dimensionalconvolutions for music) and Kalchbrenner et al(2014 word-level convolutions on the basis of pre-trained embeddings) At the input stage each targettoken is represented by a n t matrix where nrepresents the number of distinct characters in adata set and t is the uniform length to which eachtarget token is normalized through padding withtrailing zeroes or through cutting characters

We motivate the choice for a convolution ap-proach to model orthography on the basis of thefollowing considerations When using a filter sizeof eg 3 the receptive field size of the learned filtersroughly corresponds to that of syllable-like groupsin many languages (eg consonant-vowel-consonantgroups) One would therefore expect the filters tobecome sensitive morpheme-like characteristicsThe most interesting aspect of convolutions seemsthat the learned filters are being detected across allpositions in the input string which is likely torender the detection of morphemes less sensitiveto their exact position in a word This is a valuableproperty for our corpora since the non-standardorthography often allows the introduction of silentcharacters (eg lsquoghrsquo for lsquohrsquo) causing local charactershifts in the input or the kind of lsquotranslationsrsquo incomputer vision to which we know convolutionsare fairly robust (lsquogelikrsquo vs lsquoghelikrsquo)2 Finally itshould be stressed that the learned filters can beexpected to offer a flexible string representation a

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 5 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 3: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

label which unambiguously links words to the sameentry (headword) in a lexical resource such as a dic-tionary if they only differ in inflection or spelling(Knowles and Mohd Don 2004)

In the present article we conceive of lemmatiza-tion as a high-dimensional classification task wherelemmas are assigned to tokens as class labels (iedictionary headwords are the systemrsquos output)Therefore our lemmatizer does not perform any ex-plicit PoS disambiguation (although the system couldeasily be trained on PoS labels too) Because of thisclassification setting the proposed lemmatizer willnot be able to predict new unseen lemmas at testtime because it has not seen these labels during trai-ning(cf Chrupala et al 2008) However this also pre-vents the lemmatizer from outputting non-existentlsquogibberishrsquo labels which is desirable given the limitedsize of our data sets Lemmatization too has been anactive area of research in computational linguisticsespecially for languages with a rich inflectionalsystem (Daelemans et al 2009 Scherrer and Erjavec2013 De Pauw and De Schryver 2008 Lyras et al2008) Many studies apply a combination of existingtaggers (for contextual disambiguation) with an opti-mized spelling normalization component to betterdeal with unknown words

One rough distinction which could be made is be-tween lsquoonlinersquo and lsquoofflinersquo approaches In lsquoonlinersquoapproaches when confronted with an unknownword at the testing stage systems will explicitly at-tempt to identify the unknown word as a spelling vari-ant of a known token (ie a token which was seenduring training) before applying any disambiguationroutines To this end researchers typically apply acombination of (weighted) string distance measures(eg Levenshtein distance Levenshtein 1966) Whilethe online strategy might result in a high recall it canalso be expensive to apply because the unknown tokenhas to be matched with all known tokens and somestring distance measures can be costly to apply Theoffline strategy follows a different road at trainingtime it will attempt to generate new plausible spellingvariants of known tokens through aligning words andapplyingcommoneditpatternstogeneratenewformsAt the testing stage new words can simply be matchedagainst the newly generated forms Interestingly thisapproach will be fast to apply during testing (and

might result in a high precision) but it might alsosuffer from the noise being introduced in the variantgeneration phase (eg overlapping variants)1 VanHalteren and Rem (2013) have reported the successfulapplication of such an approach which is in manyways reminiscent of the generation of search queryvariants in Information Retrieval research

3 lsquoDeeprsquo Representation Learning

In this article we describe a novel system forlemmatization using lsquodeeprsquo representation learningwhich has barely been applied to the problem oflemmatization for historic andor low-resource lan-guages specifically Importantly we will show that asingle system reaches state-of-the-art tagging per-formance across a variety of data sets even withoutthe problem-specific fine-tuning of hyper-param-eters Deep learning is a popular paradigm in ma-chine learning (LeCun et al 2015) It is based onthe general idea of multi-layer neural networks soft-ware systems in which information is propagatedthrough a layered architecture of inter-connectedneurons (or information units) that iterativelytransform the input information and feed it to sub-sequent layers in the networks While theoreticallythe idea of neural networks has been around for along time it has only recently become feasible totrain more complex architectures because of themillions of parameters that typically have to be opti-mized during training

Deep learning is part of a broader field calledrepresentation learning or feature learning (Bengioet al 2013) Deep learning can indeed be contrastedwith more traditional approaches in machine learn-ing in that it is very much geared towards findinggood representations in the input data or extractlsquofeaturesrsquo from it Roughly put traditional learningalgorithms would be very much dependent on aresearcherrsquos representation of the input data andwould try to solve a task on the basis of the particu-lar features suggested and manually engineered bythe researchers In Deep Learning the process offeature engineering is to a larger extent outsourcedto the learning algorithm itself Instead of solelyrelying on the pre-specified features in the input

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 3 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

data (ie hand-crafted by humans) deep neural net-works will attempt to independently recombine theexisting input features into new combinations offeatures which typically offer a better representationof the input data It does so by iteratively projectingthe input onto subsequent layers of informationunits which are typically said to increasingly cap-ture more abstract and complex patterns in theinput data (Bengio et al 2013)

The intuition behind deep learning is typicallyillustrated using an example from computervision the field where some of the earliest break-throughs in Deep Learning have been reported (eghandwritten digit recognition LeCun et al 1990)In computer vision images are typically representedas a two-dimensional raster of pixels which can takecertain values across a series of input channels (egone for every color in the RGB spectrum) Whenpropagating an input image through a layeredneural network it has been shown that the earliestlayers in the network will be receptive to very rawprimitive shapes in the data such as a corner partsof a curve or a stark light-dark contrast calledlsquoedgesrsquo (LeCun et al 2015) Only at subsequentlayers in the network these primitive forms arerecombined into higher-level units such as an eyeor ear At still higher layers the network will even-tually become sensible to complex units such asfaces It can therefore be said that such neural net-works are able to learn increasingly abstract(re)combinations of the original features in theinput (Bengio et al 2013) Interestingly it hasbeen shown that visual perception in mammalsworks in a highly similar hierarchical fashion(Cahieu et al 2014) Because of its hierarchicalnature this form of learning is commonly calledlsquodeeprsquo since it increasingly captures lsquodeeperrsquo aspectsof the problem it is working on

4 Architecture

The system architecture we describe here primarilybuilds upon two basic components derived fromrecent studies in Deep Learning for NLP (i) one-dimensional or lsquotemporalrsquo lsquoconvolutionsrsquo tomodel the orthography of input words at the

character-level and (ii) word lsquoembeddingsrsquo tomodel the lexical context surrounding the inputtokens for the purpose of disambiguation We willfirst introduce the type of lsquosubnetsrsquo associated withboth components before we go on to discuss howour architecture eventually combines these subnets

41 ConvolutionsConvolutions are a popular approach in computervision to battle the problem of lsquotranslationsrsquo inimages (LeCun et al 1990 2015) Consider anobject classification system whose task it is todetect the presence of certain objects in an imageIf during training the system has come across eg abanana in the upper-left corner of a picture wewould like the system to be able to detect thesame object in new images even if the exact locationof that object has shifted (eg towards the lower-right corner of the image) Convolutions can bethought of as a fixed size window (eg 55 pixels)which gets slided over the entire input image in astepwise fashion Convolutions are therefore alsosaid to learn lsquofiltersrsquo in that they learn a filteringmechanism to detect the presence of certain fea-tures irrespective of the exact position in whichthese features occur in the image Importantlyeach filter will scan the entire image for the localpresence of a particular low-level feature (eg theyellow-coloured curved edge of the banana) anddetect it irrespective of the exact position of thefeature (eg lower-right corner versus upper-leftcorner) Typically convolutions are applied as theinitial layers in a neural network before passing onthe activation of filters to subsequent layers in thenetwork where the detected features can be recom-bined into more complex features (eg an entirebanana) Whereas convolutions in vision researchare typically two-dimensional (cf raster of pixels)here we apply convolutions over a single dimensionnamely the sequence of characters in an input wordThe idea to apply convolutions to character series inNLP has recently been pioneered by Zhang et al(2015) reporting strong results across a variety ofhigher-end text classification tasks Such convolu-tions have also been called lsquotemporalrsquo convolutionsbecause they model a linear sequence of items

M Kestemont et al

4 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Representing input words as a sequence of char-acters (ie at the sub-word level) has a considerableadvantage over more traditional approaches in se-quence tagging Here input words are typically rep-resented using a lsquoone-hot vocabulary encodingrsquogiven an indexed vocabulary of n known tokenseach token is represented as a long vector of nvalues in which the index of one token is set to 1whereas all other values are set to 0 Naturally thisleads to very sparse high-dimensional word repre-sentations In such a one-hot representation it istherefore not uncommon to apply a cut-off andonly encode the more frequent vocabulary itemsto battle the sparsity of the input data Of coursea one-hot vector representation (especially with asevere cut-off) cannot represent new unseentokens at test time since these do not form partof the indexed training vocabulary Most systemsfor tasks like PoS tagging would in those casesback off to additional (largely hand-crafted) featuresto represent the target token such as the final char-acter trigram of the token to capture the tokenrsquosinflection (eg Zavrel and Daelemans 1999)

The character-level system presented here doesnot struggle to model out-of-vocabulary wordsThis is a worthwhile characteristic with respect tothe present use case in languages with a rich ortho-graphical and morphological variation the dimen-sionality of word-level one-hot vectors dramaticallyincreases because the token vocabulary is muchlarger and thus the effect of applying a cut-offwill also be more dramatic (Kestemont et al2010) Additionally the proportion of unknowntarget tokens at test time is of course much moreprominent than in traditional languages due tospelling variation Therefore our architecture aimsto complement or even replace a conventionalone-hot encoding with a convolutional component

With respect to the convolutional subnet oursystem models each input token as a sequence ofcharacters in which each character is representedas a one-hot vector at the character level As theexample in Table 1 shows this leads to an at-first-sight primitive yet much lower-dimensional tokenrepresentation compared to a traditional token-levelone-hot encoding which additionally is able torepresent out-of-vocabulary items We include all

characters in the character index used We specifya certain threshold t (eg 15 characters) wordslonger or shorter than this threshold are respect-ively truncated to the right to length t or are paddedwith zero-filled vectors In an analogy to computervision research this representation models charac-ters as if they were lsquopixelsrsquo with a channel dedicatedto a single character

Our model now slides convolutional filters overthis two-dimensional representation with a lengthof 3 and a stride (or lsquostep sizersquo) of 1 In practice thiswill mean that the convolutional layer will itera-tively inspect partially overlapping consecutivecharacter trigrams in the input string This repre-sentation is inspired by the generic approach to se-quence modelling in NLP followed by Zhang et al(2015 character-level convolutions) as well asDieleman and Schrauwen (2014 one-dimensionalconvolutions for music) and Kalchbrenner et al(2014 word-level convolutions on the basis of pre-trained embeddings) At the input stage each targettoken is represented by a n t matrix where nrepresents the number of distinct characters in adata set and t is the uniform length to which eachtarget token is normalized through padding withtrailing zeroes or through cutting characters

We motivate the choice for a convolution ap-proach to model orthography on the basis of thefollowing considerations When using a filter sizeof eg 3 the receptive field size of the learned filtersroughly corresponds to that of syllable-like groupsin many languages (eg consonant-vowel-consonantgroups) One would therefore expect the filters tobecome sensitive morpheme-like characteristicsThe most interesting aspect of convolutions seemsthat the learned filters are being detected across allpositions in the input string which is likely torender the detection of morphemes less sensitiveto their exact position in a word This is a valuableproperty for our corpora since the non-standardorthography often allows the introduction of silentcharacters (eg lsquoghrsquo for lsquohrsquo) causing local charactershifts in the input or the kind of lsquotranslationsrsquo incomputer vision to which we know convolutionsare fairly robust (lsquogelikrsquo vs lsquoghelikrsquo)2 Finally itshould be stressed that the learned filters can beexpected to offer a flexible string representation a

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 5 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 4: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

data (ie hand-crafted by humans) deep neural net-works will attempt to independently recombine theexisting input features into new combinations offeatures which typically offer a better representationof the input data It does so by iteratively projectingthe input onto subsequent layers of informationunits which are typically said to increasingly cap-ture more abstract and complex patterns in theinput data (Bengio et al 2013)

The intuition behind deep learning is typicallyillustrated using an example from computervision the field where some of the earliest break-throughs in Deep Learning have been reported (eghandwritten digit recognition LeCun et al 1990)In computer vision images are typically representedas a two-dimensional raster of pixels which can takecertain values across a series of input channels (egone for every color in the RGB spectrum) Whenpropagating an input image through a layeredneural network it has been shown that the earliestlayers in the network will be receptive to very rawprimitive shapes in the data such as a corner partsof a curve or a stark light-dark contrast calledlsquoedgesrsquo (LeCun et al 2015) Only at subsequentlayers in the network these primitive forms arerecombined into higher-level units such as an eyeor ear At still higher layers the network will even-tually become sensible to complex units such asfaces It can therefore be said that such neural net-works are able to learn increasingly abstract(re)combinations of the original features in theinput (Bengio et al 2013) Interestingly it hasbeen shown that visual perception in mammalsworks in a highly similar hierarchical fashion(Cahieu et al 2014) Because of its hierarchicalnature this form of learning is commonly calledlsquodeeprsquo since it increasingly captures lsquodeeperrsquo aspectsof the problem it is working on

4 Architecture

The system architecture we describe here primarilybuilds upon two basic components derived fromrecent studies in Deep Learning for NLP (i) one-dimensional or lsquotemporalrsquo lsquoconvolutionsrsquo tomodel the orthography of input words at the

character-level and (ii) word lsquoembeddingsrsquo tomodel the lexical context surrounding the inputtokens for the purpose of disambiguation We willfirst introduce the type of lsquosubnetsrsquo associated withboth components before we go on to discuss howour architecture eventually combines these subnets

41 ConvolutionsConvolutions are a popular approach in computervision to battle the problem of lsquotranslationsrsquo inimages (LeCun et al 1990 2015) Consider anobject classification system whose task it is todetect the presence of certain objects in an imageIf during training the system has come across eg abanana in the upper-left corner of a picture wewould like the system to be able to detect thesame object in new images even if the exact locationof that object has shifted (eg towards the lower-right corner of the image) Convolutions can bethought of as a fixed size window (eg 55 pixels)which gets slided over the entire input image in astepwise fashion Convolutions are therefore alsosaid to learn lsquofiltersrsquo in that they learn a filteringmechanism to detect the presence of certain fea-tures irrespective of the exact position in whichthese features occur in the image Importantlyeach filter will scan the entire image for the localpresence of a particular low-level feature (eg theyellow-coloured curved edge of the banana) anddetect it irrespective of the exact position of thefeature (eg lower-right corner versus upper-leftcorner) Typically convolutions are applied as theinitial layers in a neural network before passing onthe activation of filters to subsequent layers in thenetwork where the detected features can be recom-bined into more complex features (eg an entirebanana) Whereas convolutions in vision researchare typically two-dimensional (cf raster of pixels)here we apply convolutions over a single dimensionnamely the sequence of characters in an input wordThe idea to apply convolutions to character series inNLP has recently been pioneered by Zhang et al(2015) reporting strong results across a variety ofhigher-end text classification tasks Such convolu-tions have also been called lsquotemporalrsquo convolutionsbecause they model a linear sequence of items

M Kestemont et al

4 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Representing input words as a sequence of char-acters (ie at the sub-word level) has a considerableadvantage over more traditional approaches in se-quence tagging Here input words are typically rep-resented using a lsquoone-hot vocabulary encodingrsquogiven an indexed vocabulary of n known tokenseach token is represented as a long vector of nvalues in which the index of one token is set to 1whereas all other values are set to 0 Naturally thisleads to very sparse high-dimensional word repre-sentations In such a one-hot representation it istherefore not uncommon to apply a cut-off andonly encode the more frequent vocabulary itemsto battle the sparsity of the input data Of coursea one-hot vector representation (especially with asevere cut-off) cannot represent new unseentokens at test time since these do not form partof the indexed training vocabulary Most systemsfor tasks like PoS tagging would in those casesback off to additional (largely hand-crafted) featuresto represent the target token such as the final char-acter trigram of the token to capture the tokenrsquosinflection (eg Zavrel and Daelemans 1999)

The character-level system presented here doesnot struggle to model out-of-vocabulary wordsThis is a worthwhile characteristic with respect tothe present use case in languages with a rich ortho-graphical and morphological variation the dimen-sionality of word-level one-hot vectors dramaticallyincreases because the token vocabulary is muchlarger and thus the effect of applying a cut-offwill also be more dramatic (Kestemont et al2010) Additionally the proportion of unknowntarget tokens at test time is of course much moreprominent than in traditional languages due tospelling variation Therefore our architecture aimsto complement or even replace a conventionalone-hot encoding with a convolutional component

With respect to the convolutional subnet oursystem models each input token as a sequence ofcharacters in which each character is representedas a one-hot vector at the character level As theexample in Table 1 shows this leads to an at-first-sight primitive yet much lower-dimensional tokenrepresentation compared to a traditional token-levelone-hot encoding which additionally is able torepresent out-of-vocabulary items We include all

characters in the character index used We specifya certain threshold t (eg 15 characters) wordslonger or shorter than this threshold are respect-ively truncated to the right to length t or are paddedwith zero-filled vectors In an analogy to computervision research this representation models charac-ters as if they were lsquopixelsrsquo with a channel dedicatedto a single character

Our model now slides convolutional filters overthis two-dimensional representation with a lengthof 3 and a stride (or lsquostep sizersquo) of 1 In practice thiswill mean that the convolutional layer will itera-tively inspect partially overlapping consecutivecharacter trigrams in the input string This repre-sentation is inspired by the generic approach to se-quence modelling in NLP followed by Zhang et al(2015 character-level convolutions) as well asDieleman and Schrauwen (2014 one-dimensionalconvolutions for music) and Kalchbrenner et al(2014 word-level convolutions on the basis of pre-trained embeddings) At the input stage each targettoken is represented by a n t matrix where nrepresents the number of distinct characters in adata set and t is the uniform length to which eachtarget token is normalized through padding withtrailing zeroes or through cutting characters

We motivate the choice for a convolution ap-proach to model orthography on the basis of thefollowing considerations When using a filter sizeof eg 3 the receptive field size of the learned filtersroughly corresponds to that of syllable-like groupsin many languages (eg consonant-vowel-consonantgroups) One would therefore expect the filters tobecome sensitive morpheme-like characteristicsThe most interesting aspect of convolutions seemsthat the learned filters are being detected across allpositions in the input string which is likely torender the detection of morphemes less sensitiveto their exact position in a word This is a valuableproperty for our corpora since the non-standardorthography often allows the introduction of silentcharacters (eg lsquoghrsquo for lsquohrsquo) causing local charactershifts in the input or the kind of lsquotranslationsrsquo incomputer vision to which we know convolutionsare fairly robust (lsquogelikrsquo vs lsquoghelikrsquo)2 Finally itshould be stressed that the learned filters can beexpected to offer a flexible string representation a

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 5 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 5: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Representing input words as a sequence of char-acters (ie at the sub-word level) has a considerableadvantage over more traditional approaches in se-quence tagging Here input words are typically rep-resented using a lsquoone-hot vocabulary encodingrsquogiven an indexed vocabulary of n known tokenseach token is represented as a long vector of nvalues in which the index of one token is set to 1whereas all other values are set to 0 Naturally thisleads to very sparse high-dimensional word repre-sentations In such a one-hot representation it istherefore not uncommon to apply a cut-off andonly encode the more frequent vocabulary itemsto battle the sparsity of the input data Of coursea one-hot vector representation (especially with asevere cut-off) cannot represent new unseentokens at test time since these do not form partof the indexed training vocabulary Most systemsfor tasks like PoS tagging would in those casesback off to additional (largely hand-crafted) featuresto represent the target token such as the final char-acter trigram of the token to capture the tokenrsquosinflection (eg Zavrel and Daelemans 1999)

The character-level system presented here doesnot struggle to model out-of-vocabulary wordsThis is a worthwhile characteristic with respect tothe present use case in languages with a rich ortho-graphical and morphological variation the dimen-sionality of word-level one-hot vectors dramaticallyincreases because the token vocabulary is muchlarger and thus the effect of applying a cut-offwill also be more dramatic (Kestemont et al2010) Additionally the proportion of unknowntarget tokens at test time is of course much moreprominent than in traditional languages due tospelling variation Therefore our architecture aimsto complement or even replace a conventionalone-hot encoding with a convolutional component

With respect to the convolutional subnet oursystem models each input token as a sequence ofcharacters in which each character is representedas a one-hot vector at the character level As theexample in Table 1 shows this leads to an at-first-sight primitive yet much lower-dimensional tokenrepresentation compared to a traditional token-levelone-hot encoding which additionally is able torepresent out-of-vocabulary items We include all

characters in the character index used We specifya certain threshold t (eg 15 characters) wordslonger or shorter than this threshold are respect-ively truncated to the right to length t or are paddedwith zero-filled vectors In an analogy to computervision research this representation models charac-ters as if they were lsquopixelsrsquo with a channel dedicatedto a single character

Our model now slides convolutional filters overthis two-dimensional representation with a lengthof 3 and a stride (or lsquostep sizersquo) of 1 In practice thiswill mean that the convolutional layer will itera-tively inspect partially overlapping consecutivecharacter trigrams in the input string This repre-sentation is inspired by the generic approach to se-quence modelling in NLP followed by Zhang et al(2015 character-level convolutions) as well asDieleman and Schrauwen (2014 one-dimensionalconvolutions for music) and Kalchbrenner et al(2014 word-level convolutions on the basis of pre-trained embeddings) At the input stage each targettoken is represented by a n t matrix where nrepresents the number of distinct characters in adata set and t is the uniform length to which eachtarget token is normalized through padding withtrailing zeroes or through cutting characters

We motivate the choice for a convolution ap-proach to model orthography on the basis of thefollowing considerations When using a filter sizeof eg 3 the receptive field size of the learned filtersroughly corresponds to that of syllable-like groupsin many languages (eg consonant-vowel-consonantgroups) One would therefore expect the filters tobecome sensitive morpheme-like characteristicsThe most interesting aspect of convolutions seemsthat the learned filters are being detected across allpositions in the input string which is likely torender the detection of morphemes less sensitiveto their exact position in a word This is a valuableproperty for our corpora since the non-standardorthography often allows the introduction of silentcharacters (eg lsquoghrsquo for lsquohrsquo) causing local charactershifts in the input or the kind of lsquotranslationsrsquo incomputer vision to which we know convolutionsare fairly robust (lsquogelikrsquo vs lsquoghelikrsquo)2 Finally itshould be stressed that the learned filters can beexpected to offer a flexible string representation a

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 5 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 6: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

filter which learned to detect the pattern lsquodaersquo mightstill be responsive to the pattern lsquodairsquo which isuseful since lsquoirsquo and lsquoersquo were often used interchange-ably as vowel lengthening graphemes in the medi-eval language considered here

42 EmbeddingsMany sequence taggers crucially depend on context-ual information for disambiguating tokens In theclassic example lsquoThe old man the boatrsquo contextualinformation is needed to determine that lsquomanrsquo inthis sentence should be tagged as lsquoverbrsquo instead ofthe more common lsquonounrsquo tag for this token Mostmodern taggers therefore integrate a lexical repre-sentation of the immediate context of an inputtoken eg the two tokens preceding the inputtoken to the left and a single token to the rightRoughly put traditional architectures will typicallyrepresent this context using a one-hot encoder atthe token-level for each token in the positional skel-eton surrounding the input token a high-dimen-sional vector will record the presence using theindex of the previously indexed training tokens(eg Zavrel and Daelemans 1999) To battle sparsityin the contextual representation too it is again notuncommon to apply a frequency-based cut-off hereor even only record the presence of a small set offunctors surrounding the input token or even hand-crafted features (such as the last two characters ofthe word preceding the input string as they oftencoincide with morphological suffixes)

Because of the sparsity associated with one-hotencodings research in NLP has stressed the need forsmoother distributional representations of lexicalcontexts in this respect As the size of the context

windows considered increases it becomes increas-ingly difficult to obtain good dense representationsof contexts especially if one restricts the model tostrict n-gram matching (Bengio et al 2003) Asimple example in a word sequence like lsquothe blackfat catrsquo or lsquothe cute white catrsquo the PoS category oflsquocatrsquo can be partially inferred from the fact that twoadjectives precede the target token When presentedwith the new examples lsquothe fat white catrsquo or lsquothe cutefat catrsquo the left contexts lsquofat whitersquo and lsquocute fatrsquo willfail to find a hard match to one of the trainingcontexts in the case of a strict matching processNaturally we would like a model to be robust inthe face of such naive transpositions

An important line of research in current NLPtherefore concerns the development of modelswhich can represent words not in terms of asimple one-hot encoder but by using smoothlower-dimensional word representations Much ofthis research can be situated in the field of distribu-tional semantics being guided by the so-calledlsquoDistributional Hypothesisrsquo that words in themselvesdo not have real meaning in isolation but thatwords primarily derive meaning from the wordsthey tend to co-occur with ie their lexical context(Harris 1954 Firth 1957) While blarf is a non-existing word its use in the following sentences sug-gests that it refers to some sort of domestic animalperhaps a dog lsquoIrsquom letting the blarf outrsquo lsquoIrsquom feed-ing the blarfrsquo lsquoThe blarf ate my homeworkrsquo Thusby modelling patterns of word co-occurrences inlarge corpora research has demonstrated that vari-ous unsupervised techniques can yield numericword representations that offer useful approxima-tions of the meaning of words (Baroni and Lenci

Table 1 Illustration of the character-level representation of input tokens Dummy example with length threshold t set

at 15 and a restricted character vocabulary of size 7 for the Middle Dutch word coninghinne (lsquoqueenrsquo)

char c o n i n g h i n n e - - - -

c 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

g 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

h 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

i 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0

n 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0

o 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

M Kestemont et al

6 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 7: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

2010) The vectors which make up such word rep-resentations are also called lsquoword embeddingsrsquo be-cause they reflect how words are contextuallylsquoembeddedrsquo in corpora

The past years have witnessed a clear increase inthe interest in distributional semantics in particularin the field of deep learning for NLP Mikolov et al(2013) have introduced an influential lsquoskipgramrsquomethod to obtain word vectors commonly knownas lsquoword2vecrsquo Because the underlying method at-tempts to optimize a fairly simple training criterionit can be easily applied to vast quantities of textyielding vectors which have shown excellent per-formance in a variety of tasks In one popular ex-ample Mikolov et al (2013) even showed that ispossible to solve relatively advanced analogicalproblems applying plain vector arithmetic to theword embeddings obtained from large corporaThe result of the arithmetic expression lsquovec-tor(king)-vector(man)thornvector(woman)rsquo for in-stance yielded a vector which was closest to thatof the word lsquoqueenrsquo Other illustrative examples in-clude lsquovector(russia)thornvector(river) vector(wolga)rsquo and lsquovector(paris)-vector(france)thornvector(belgium) vector(brussels)rsquo

Follow-up studies have stressed that other tech-niques can yield similar results and that the theor-etical foundations of the word2vec algorithm stillneed some clarification (Levy and Golberg 2014)Nevertheless the fact that Mikolov et al published ahighly efficient open-source implementation of thealgorithm has greatly contributed to the state of theart in the field of word representation learning3 Anincreasing number of applications in NLP includesome form of word embedding strategy in theirpipeline as a lsquosecret saucersquo (Manning 2015) An es-sential quality of modern word embeddings is that ithas been shown that they are not restricted to se-mantic aspects of words but that they also modelmorpho-syntactic qualities of words illustrated bythe fact that purely syntactic analogies often can alsobe solved lsquovector(her)-vector(she)thornvector(he) vector(his)rsquo

The number of recent studies using word embed-dings in NLP research is vast A relevant example forthe present research includes for instance the typeof word embeddings used in the Twitter taggerT

able

2O

verv

iew

of

som

eo

fth

em

etad

ata

and

char

acte

rist

ics

of

the

Mid

dle

Du

tch

corp

ora

use

din

this

stu

dy

Nam

eC

orp

us-

Gys

seli

ng

Lit

erar

yte

xts

Co

rpu

s-G

ysse

lin

g

Ad

min

istr

ativ

ete

xts

Mis

cell

aneo

us

reli

gio

us

text

sV

anR

een

en-M

uld

er-

Ad

elh

eid

-co

rpu

s

Ab

bre

viat

ion

CG

-LIT

CG

-AD

MIN

RE

LIG

AD

EL

HE

ID

Ori

gin

Inst

itu

tefo

rD

utc

hL

exic

ogr

aph

yIn

stit

ute

for

Du

tch

Lex

ico

grap

hy

CL

iPS

Co

mp

uta

tio

nal

Lin

guis

tics

Gro

up

and

Ru

usb

roec

Inst

itu

te

Un

iver

sity

of

An

twer

p

Rad

bo

ud

Un

iver

sity

Nij

meg

enet

c

in

div

idu

alte

xts

301

574

336

792

Tex

tva

riet

yM

isce

llan

eou

sli

tera

ryte

xts

(mo

stly

rhym

ed)

wh

ich

surv

ive

fro

mo

rigi

nal

man

usc

rip

ts

dat

edb

efo

re13

00A

D

13th

cen

tury

char

ters

Mis

cell

aneo

us

reli

gio

us

text

s

(eg

B

ible

tran

slat

ion

sm

ysti

cal

visi

on

sse

rmo

ns)

span

nin

gm

ult

iple

cen

turi

esan

dre

gio

ns

14th

cen

tury

char

ters

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 7 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 8: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

described by Godin et al (2014) but the literatureabounds in other examples In our system we traina vanilla word2vec model using a popular imple-mentation of the original skipgram model4 Thismodel learns to project the tokens in our trainingdata in a vector space consisting of 150 dimensionsThe general intuition underlying the embeddingssubnet in our architecture is that the underlyingskipgram model will have learned similar embed-dings for words that have certain similar qualitiesboth at the semantic and morpho-syntactic levelmdashin the lsquocatrsquo example above we might expect that amodel would learn similar embeddings for thelsquowhitersquo or lsquoblackrsquo because they are both adjectivesreferring to colours Additionally an interestingaspect for our medieval data is that the wordembeddings learned might also be sensitive toorthographic or dialectical variation providingsimilar embeddings from words that are spellingvariants of each other or words that are typical ofspecific dialects

For the subnets representing the lexical contexton either side we first construct a simple one-hotencoding of the context tokens with a traditionalminimum frequency-based token cut-off of two oc-currences We then project these tokens onto astandard dense layer which has the same dimension-ality as our skipgram modelmdashhere and elsewhere

taking the form of a simple matrix multiplicationthe addition of a bias and a nonlinearity (a lsquorecti-fied linear unitrsquo Nair and Hinton 2010)Importantly this implies that we can initialize theweights of this layer using the pretrained embed-dings from the skipgram model but that theseweights can be further refined during the trainingphase (see below) This initialization strategy can beexpected to speed up the convergence of the modelWe simply concatenate the dense vectors obtainedfor each context word on either side of the targettoken into a left-context subnet and a right contextsubnet The number of context words is of course ahyper-parameter which can be further tuned

43 Combining the subnetsIn our system (see Figure 1 for an overview) weagain use a networked strategy to combine the con-volutional subnet used to represent the focus tokenwith the subnets representing the left and right con-text of the focus token Next we feed the output ofthe convolutional net forward in the network usinga plain dense layer that has a dimensionality of sizek frac14 1024) At this point the concatenated embed-dings of the left and right context subnets have alsobeen propagated to two vectors of size k We canalso include a one-hot encoding of the targettokens which is propagated through a dense layer

Table 3 Tabular overview of some the main token counts etc of the Middle Dutch corpora used in this study The

lsquoupperbound unknownrsquo column refers the proportion of tokens that have lemmata which also occur in the corres-

ponding training data set this is the theoretically maximal score which a lemmatizer could achieve for the unknown

tokens

Corpus Split Number

of words

Unique

tokens

Unique

lemmas

Number of tokens

per lemma

Perc unknown

tokens

Upperbound

unknown

relig train 129144 14668 6103 265 NA NA

dev 16143 3686 2031 193 693 6878

test 16143 3767 1999 201 744 7286

crm-adelheid train 575721 41668 11163 41 NA NA

dev 165883 17466 5547 342 646 7805

test 80015 11645 4057 31 607 7907

cg-lit train 464706 47460 15454 338 NA NA

dev 58661 11557 5004 248 646 7438

test 58903 11692 5001 251 665 7494

cg-admin train 518017 42694 13129 342 NA NA

dev 54290 8753 3527 257 639 7476

test 65936 11318 4954 239 918 6431

M Kestemont et al

8 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 9: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

of size k This final layer might be important to havethe network obtain a good fit of known tokens belowwe will explore whether the convolutional layer alonehas sufficient capacity to reliably represent the entirevocabulary of training tokens

At this point we have four subnets in the networkeach of dimensionality k one resulting from the one-hot encoding of the input token one resulting fromthe convolutional encoding of the input token andtwo resulting from the embeddings subnets for theleft and right context Finally we concatenate theoutput of these subnets into a vector of size 4 kand project it onto an output layer which for eachclass in the sequence modelling task has a uniqueoutput neuron A standard lsquosoftmaxrsquo normalizationis applied to normalize the activations of theseoutput neurons as if they were probabilities (sum-ming to 1) All the parameter weights in this architec-ture are first initialized using a random distributionexcept for the contextual subnets which are initializedusing the word2vec weights We train the neural net-work using gradient descent Briefly put this learningalgorithm will start predicting a batch of the trainingexamples using the randomly initialized weights It willthen calculate the prediction loss in which this resultswith respect to the correct labels which will initially beextremely high Then it will determine how eachweight individually contributed to this loss and sub-sequently apply a small update to this weight which

would have decreased the prediction error which itcaused

During a fixed number of epochs (100 in ourcase) we feed the training data through this net-work divided in a number of randomly jumbledmini-batches each time tightening the fit of thenetwork with respect to the correct class labelsDuring training (see Figure 2) the loss of the net-work is not measured in terms of accuracy but interms of cross-entropy a common loss measure forclassification tasks We adopt a mini-batch size of50 this is a fairly small batch size compared to theliterature but this restricted batch size proved ne-cessary to avoid the higher-frequency items fromdominating the gradient updates We use the adap-tive so-called Adadelta optimization mechanism(Zeiler 2012) Importantly the Adadelta routineprevents the need to empirically set a learning rateat the beginning of training since the algorithm willindependently re-adapt this rate keeping track ofthe loss history on a per-weight basis

44 General commentsAll dense connections in this network take the formof a matrix multiplication and the addition of a bias(X Wthorn b) and make use of a lsquorelursquo activation(lsquorectified linear unitrsquo) which will ignore negativeweights outputted by the dense layer by setting themto zero Additionally all dense layers make use of

Fig 1 Schematic visualization of the network architecture evaluated in this article Embedding layers are used on bothsides of the focus token as lsquosubnetsrsquo to model the lexical context surrounding a token The token itself is primarilyrepresented using a convolutional subnet at the character-level (and potentially also an embedding layer) The results ofthese subnets are concatenated into a single hidden representation which is used to predict the correct lemma at the toplayer of the network

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 9 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 10: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

the so-called dropout procedure during training(with pfrac14 050) meaning that in each mini-batchhalf of the connections in the dense layers are ran-domly dropped (Srivastava et al 2014) This tech-nique has various beneficial effects on training themost important one being that it forces the networkto be less dependent on the specific presence of cer-tain features in the input thereby avoiding over-fitting on specific instances in the training dataAn implementation of the proposed architecture isfreely available from the public software repositoryassociated with this paper5 Where possible this re-pository also holds the data used below Our systemis implemented in Python using the keras librarybuilt on top of theano (Bastien et al 2012) thelatter library provides automatic differentiation forarbitrary graphs and allows to execute code on theGraphics Processing Unit (GPU)6 Other majordependencies of our code include scikit-learn(Pedregosa et al 2011) and gensim

5 Data Sets

In this article we report results on four differentMiddle Dutch data sets (see Table 2) which offera representative example of a medieval Europeanlanguage Van Kerckvoorde (1992) We shouldstress that our system is fairly language independentand could easily be retrained on other types of cor-pora as long as they use a character alphabet In thisarticle we first of all use two administrative datasets representing charter collections The CG-ADMIN

contains the charter collection edited by Gysseling(1977) which has been digitized and annotated bythe Institute for Dutch Lexicography (1998) Thesecharters all survive in originals predating ca 1300AD Secondly we use the CRM-ADELHEID collection acomparable collection of fourteenth century MiddleDutch charters which has been the subject of astudy comparable to ours (Van Halteren and Rem2013) As literary materials we first of all use theliterary counterpart of CG-ADMIN in the Corpus-Gysseling a collection of Middle Dutch literarytexts that all survive in manuscript copies predatingca 1300 AD (Gysseling 1980-1987 Institute Dutchfor Lexicography 1998 Kestemont et al 2010)

Thirdly we also use a newly created smallercorpus of Middle Dutch religious texts (eg ser-mons bible translation visions ) most ofwhich are at least semi-literary in nature This lastdata set (RELIG) is contained in our code repository

These data sets have all been annotated usingcommon annotation guidelines although theyeach show a number of important idiosyncrasiesTable 4 shows an example of the data Generallyspeaking we adopt the annotation practice whichhas been developed by Piet van Reenen et al andwhich has been carefully documented in the contextof the release of the Adelheid tagger-lemmatizer forMiddle Dutch charters (Van Halteren and Rem2013)7

Each token in the corpora has been annotatedwith a normalized dictionary headform or lemmain a present-day spelling Broadly speaking lemma-tization allows us to abstract over individual in-stances of tokens which only differ in inflection ororthography (Knowles and Mohd Don 2004) Forhistorical corpora this has interesting applicationsin the context of database searching (diachronic)topic modelling or stylometry The lemmas usedin our corpora all correspond to entries in theIntegrated Language Database (GTB) for the

Table 4 Example of some training data from the RELIG

corpus from the Seven Manieren van Heilige Minnen

(lsquoSeven Ways of Holy Loversquo) by the mystical poetess

Beatrice of Nazareth (lsquoThese are seven ways of loving

which come from the highestrsquo) The first column shows

the original input token the second the lemma label Note

that some forms require a composite lemma eg uitthorn de

for uten)

Token Lemma

seuen zeven

maniren manier

sin zijn

van van

minnen minne

die die

comen komen

uten uitthornde

hoegsten hoogste

M Kestemont et al

10 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 11: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Dutch language which can be consulted online andwhich is maintained by the Institute for DutchLexicography8 The major advantage of using a pre-sent-day lemma in this context is that this allowsscholars to normalize texts from multiple historicstages of Dutch to the same variant This in turnallows interesting diachronic analyses of Dutch cor-pora such as the ones currently prepared in thelarge-scale Nederlab project9

6 Evaluation

To train and evaluate our system we apply a con-ventional split of the available data into a trainingset (80) a development set (10) and a held-outtest set (10) see Table 3 While we report the per-formance and loss for the training and developmentdata the final performance of our system is reportedby evaluating a system on the held-out test data (thesystem trained on the training data only ie notincluding the development data) We evaluate thesystemrsquos performance on the development and testdata in terms of overall accuracy as well as the ac-curacy for known and unknown words (at test time)separately

For the literary and administrative data wefollow two different strategies to divide the availabledata into training development and test sets Forthe administrative data (the charter collections CG-ADMIN and CRM-ADELHEID) we follow the approach byVan Halteren and Rem (2013) to make our splittingas comparable as possible to theirs10 Each charter inthese collections is associated with a historic datewe first sort all the available charters in each corpusaccording to their date and assign a rank to eachcharter We then divide the material by assigning tothe development set all charters which have a rankindex ending in 9 and to the held-out test set allthose with an index ending in 0 The training ma-terial consists of the rest of the charters The advan-tage of this splitting approach is that charters fromvarious time periods and regions are evenly distrib-uted over the sets

For the literary data sets (CG-LIT and RELIG) thesituation is more complex since we only have at ourdisposal a limited set of texts that greatly vary in

length (especially for CG-LIT) To handle this situ-ation we adopt the approach suggested byKestemont et al (2010) We first assign an indexto each item in the texts and normalize these tothe range 000ndash100 In the development set weput all items with an index between 060 and 070and in the test set all items with an index between070 and 080 The remaining items are appended tothe training material These splits are not ideal inthe sense that the system operates more lsquoin domainrsquothan would be the case with truly free textUnfortunately the enormous divergence betweenthe length of the literary texts renders other splittingapproaches (eg leave-one-text-out validation) evenless interesting A clear advantage of the evaluationstrategy is nevertheless that it allows to assess howwell the system is able to generalize to other por-tions in a text given annotated material from thesame text This is a valuable property of the evalu-ation procedure because medieval texts also displaya significant amount of intra-text variation (ie thesame text often uses many different spellings for thesame word) Moreover this strategy is insightfulwhen it comes to incremental or active learningwhereby scholars would first tag a part of the textand retrain the system before tackling the rest of atext

7 Results

In Tables 5ndash8 we report the results of variants of ourfull architecture on all four data sets As a baselinewe also report scores for the lsquoMemory-BasedTaggerrsquo (MBT) a popular sequence tagger (espe-cially for modern Dutch) which is based on a near-est-neighbour reasoning approach While this taggeris typically used for PoS tagging it is equally usefulfor lemmatization in the classification frameworkwe adopt here We left most of the parameter set-tings at their default values except for the followingwe set the global matching algorithm to a brutenearest neighbour search and for unknown wordswe set the global metric to the Levenshtein distanceAdditionally to account for the spelling variation inthe material we allowed a set of 1000 high-fre-quency function words to be included in the

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 11 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 12: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

features used for the contextual disambiguationBecause of the experimental setup using a separatetrain development and test split of the availabledata our results are not directly comparable to pre-vious studies in this topic which typically used

10-fold cross validation This is a computationallyintensive setup which is not realistic for our neuralapproach given the considerable time required fortraining Finally our training data only covers ca90 of the training data that other systems haveavailable in a 10-fold cross validation setup withouta separate development set This means that oursystem can be expected to have a slightly worse lex-ical coverage so that a minor drop in overall accur-acy can be expected here

The main results are reported in Table 5 and 6where we list the results for the train developmentand test splits for all four corpora In Figure 2 wehave added a visualization of the typical trainingprogress of the neural network (for the training onthe CG-LIT corpus) showing how the prediction losssteadily decreases over epochs whereas the lemma-tization accuracy gradually increases For the test

Fig 2 Visualization of the typical training progress of the neural network (CG-LIT) showing how the prediction losssteadily decreasing over epochs whereas the lemmatization accuracy gradually increases

Table 5 Training and development results (lemmatiza-

tion accuracy for all known and unknown words) of

the full systems for each of the four data sets Note that

for the training data we cannot differentiate between

known and unknown tokens

CorpusPhase Train Dev

SUBSET All () All () Known () Unknown ()

RELIG 9763 9126 9467 4544

CG-LIT 9688 9178 9456 5145

CRM-ADELHEID 9838 9361 9617 5659

CG-ADMIN 9889 9544 9789 5948

M Kestemont et al

12 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 13: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

results we include the results for MBT As can beseen the algorithm obtains a reasonably good fit ofthe training data with overall scores above 96Naturally there is a considerable drop when apply-ing the trained system to the development and testsplits Importantly there is no large difference be-tween the development and test scores with the ex-ception of the CG-ADMIN test split which does seemto be much more difficult than the developmentsplit Overall the RELIG and CG-LIT appear to be themost difficult corporamdashthe former arguably be-cause of its limited size the latter because of therich diversity in text varieties which it includesBoth charter corpora yield similar scores withCRM-ADELHEID being slightly more difficult than CG-ADMIN The scores for known word accuracies arefairly solid and always above 94 (in the case ofCG-ADMIN even almost 98) As can be expectedthe unknown word accuracies are much lower ran-ging from 4555 for the smaller RELIG corpus tothe encouraging 5948 which we can report for theCG-ADMIN corpus All in all this suggests that forlarge enough corpora the present system can accur-ately lemmatize over half of the unknown words inunseen material

As can be gleaned from Table 6 MBT yields avery strong baseline for the known words in theevaluation splits and for this category the newsystem only marginally outperforms the baselineIt is of course especially for the unknown wordsthat the present optimized system yields superiorresults where the improvement spans dozens ofpercentage points In comparison to other studiesthe results appear to be strong Kestemont et al

(2010) report a lemmatization of 4552 for theunknown words in the CG-LIT corpus which the pre-sent approach clearly improves upon (5265) not-withstanding the fact that it has access to lesstraining data Van Halteren and Rem (2013) use10-fold cross-validation on the CRM-ADELHEID andreport an overall lemmatization accuracy of9497 the present system is just below that score(9395) Van Halteren and Rem (2013) do notreport scores for known and unknown words sep-arately which makes it difficult to speculate aboutthis issue but apart from the validation procedureitself the fact that the present system has less train-ing data available might be responsible for this dropAdditionally their tagger is in fact based on a meta-learner which combines the output of several inde-pendent taggers an approach which will almostalways outperform the performance of a singlesystem

The present article placed a lot of emphasis onthe use of character-level convolutions as a replace-ment for the traditional one-hot encoding For un-known tokens such an approach seemsadvantageous but the danger of course exists thatsuch a modelling strategy will lack the capacity toeffectively model known words for which a one-hotencoding does seem a viable approach In Tables 7and 8 we present the results of two version of thesystem the first version only includes a convolu-tional model of the focus tokens whereas thesecond version adds a one-hot embedding of thefocus tokens Both versions use 5000 filters butleave out any contextual features to be able tozoom in on the convolutional aspects of the

Table 6 Final test results (lemmatization accuracy for all known and unknown words) of the full systems for each of

the four data sets We included the baseline offered by the Memory-Based Tagger which is especially competitive in the

case of known tokens

Corpusmethod MBT Our system

Subset All Known Unknown All Known Unknown

RELIG 8850 9403 1965 9097 9450 4704

CG-LIT 8872 9393 1564 9167 9445 5265

CRM-ADELHEID 9187 9590 2944 9395 9615 5986

CG-ADMIN 8835 9503 2230 9091 9535 4699

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 13 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 14: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

model only The results are somewhat surprising inthat it appears that the first convolutional-only ver-sion consistently outperforms the system that in-cludes a one-hot embedding Interestingly thedrop in performance across all corpora comesfrom the unknown words it seems that the presenceof the one-hot embeddings dominates the optimiza-tion process so that much less information is ableto be backpropagated to the convolutional subnetleading to a much looser fit of the unknown vo-cabulary in the test data

8 Discussion

What does the presented architecture learn To ad-dress this interpretative question we offer two in-tuitive visualizations In Figure 3 we plot theembeddings which are learned by the initial word2-vec model (Mikolov et al 2013) which is fed into

the embedding layers of the networks before train-ing commences (here for the CG-LIT corpus) We usethe t-SNE algorithm (Van der Maaten and Hinton2008) which is commonly used to visualize suchembeddings in two-dimensional scatterplots Wehave performed a conventional agglomerative clus-ter analysis on top of the scatterplot and we addedcolours to the word labels according to this clusteranalysis as an aid in reading This visualizationstrongly suggests that morpho-syntactic ortho-graphic as well as semantic qualities of wordsappear to be captured by the initial embeddingsmodel In purple to the right we find the namesof the biblical evangelists (mattheus marcus )which clearly cluster together on the basis of seman-tic qualities In light-blue (top-left) we find a se-mantic grouping of the synonyms seide and sprac(lsquo[he] saidrsquo) Orthographic similarity is also cap-tured which is remarkable because the embeddingsmodel does not have access to sub-word level

Table 7 Developments results for two variants of the system both variants have a convolutional subnet (5000 filters)

but no contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of

the known words than the convolutional-only variant

Corpusmethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8945 9269 4597 8759 9167 3283

CG-LIT 8846 9120 4879 8286 8742 1686

CRM-ADELHEID 8958 9228 5057 8796 9173 3338

CG-ADMIN 9185 9445 5376 8962 9328 3603

Table 8 Test results for two variants of the system both variants have a convolutional subnet (5000 filters) but no

contextual subnets Importantly the second system adds an embeddings layer which might allow a tighter fit of the

known words than the convolutional-only variant

CorpusMethod Only convolutions (5000 filters) Convolutions (5000 filters) and a

one-hot encoding of the focus token

Subset All () Known () Unknown () All () Known () Unknown ()

RELIG 8895 9238 4638 8737 9153 3555

CG-LIT 8825 9103 4921 8246 8711 1728

CRM-ADELHEID 8991 9224 5388 8863 9200 3640

CG-ADMIN 8654 9091 4339 8327 8881 2846

M Kestemont et al

14 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 15: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

information the spelling variants als and alse (lsquoifrsquo)are for instance grouped At the morpho-syntacticlevel we see that the subordinating conjunctionswant and maer cluster (light-blue) Even colloca-tional patterns are present eg ict wi and laswhich cluster together in green) a clusteringwhich seems to stem from the collocation lsquoas Iwereadrsquo

At the subword level it is equally interesting toattempt to inspect the filters which have beenlearned at the character level In Table 9 we offera representation of a number of interesting filterswhich have been learned For each position in theconvolutional slots we record the three characterswhich had the highest activation with respect to thisspecific filter While this discussion is necessarilyanecdotal these results do suggest that the

convolutional filters have indeed become sensitiveto both morphemic and orthographic informationFilter 17 for instance is clearly sensitive to the com-bination of the vowel i and a velar stop c k or ckThis filter would be useful to detect the frequentpronoun lsquoickrsquo (lsquoIrsquo) but might be equally useful forsimilar morphemes inside words (eg sticke lsquopiecersquo)Filter 40 on the other hand would be sensitive tothe presence of these velar stops in combinationswith the front vowel a (potentially long aa) Heretoo we clearly notice a certain flexibility in the filteras to the exact spelling of the velar stop or the lengthof the vowel Filter 8 finally does not seem sensitiveto a particular consonant-vowel combination butrather seems to pick up the mere presence of thelabio-dental fricative f or v in the final slot of the fil-ter including the allographic spelling u for the latter

Fig 3 Two-dimensional scatterplot created by the t-SNE algorithm representing the embeddings learned by theword2vec pretrainer (for the CG-LIT corpus) These embeddings are eventually fed into the embedding layers of thenetwork before training commences (so that they can be further optimized in the light of a given task) Both morpho-syntactic orthographic and semantic qualities of words appear to captured by the model

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 15 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 16: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

consonant Here too the filters allow these charac-ters to some extent to be detected in various slots inthe filters thereby illustrating how the filters haveobtained a certain translational invariance as is typ-ically achieved in image recognition

9 Future Research

While our system yields excellent results across alldata sets a number of interesting issues remain tobe addressed in future research First of all ourarchitecture does not consider any form of recurrentneural networks through which recently excellentresults have been obtained in various tasks acrossNLP Of course it would be well feasible to runeg an Long Short-Term Memory (LSTM) layer(cf Hochreiter and Schmidhuber 1997) over ourconvolutions or embeddings instead of a traditionaldense layer Preliminary experiments on these spe-cific data sets did not suggest that dramatic per-formance increases can be gained here (perhapsbecause of the limited length of the sequences con-sidered here) but further research is needed in thisrespect Additionally it might be worthwhile tocombine LSTMs with an approach which does not

only slide convolutions over the focus token butalso over the lexical context thus making thesystem more robust towards unknown tokens inthe focus tokenrsquos context as well Bidirectionallayers are another interesting extension in this re-spect since the present model only considers astraightforward left-to-right strategy Another pos-sible extension would be to pre-train certain com-ponents of the model (eg embeddings or characterconvolutions) on larger unannotated data sets (pro-vided these are available) Further research shouldalso overcome the limitation of the present systemthat it cannot predict new unseen lemmas at testtime Here too recurrent network architectures thatare able to generate new strings on the basis of aninput string will prove a valuable approach (LeCunet al 2015)

One strong component of the system is that asingle hyperparameterization yields strong resultsacross a variety of corpora This suggests that fur-ther performance increases can be obtained throughmore specific fine-tuning Additionally our systemreaches high scores although it is only trained on80 of the available data other papers on this fieldhave used 10-fold cross-validation and thus averageover systems which have available 90 of the train-ing data in each run As previously mentioned how-ever the different evaluation procedures make itdifficult to compare these results directly Thegood results obtained are also remarkable in lightof the fact that our system does not have access tomany of the conventional lsquobells and whistlesrsquo whichother taggers offer (such as removing highly uncom-mon tags for certain tokens) Generally speakingone serious drawback is that training this kind ofarchitecture is extremely time-consuming and is inreality only feasible on GPUs which still come withserious memory limitations

AcknowledgementsWe gratefully acknowledge the support of NVIDIACorporation with the donation of the TITAN Xused for this research Additionally the authorswould like to thank Dr Sander Dieleman for hisvaluable feedback on earlier drafts of this article

Table 9 Anecdotal representation of a number of inter-

esting convolutional filters learned by the network after

training on the CG-LIT corpus For three example filters we

show for each slot the three characters with the highest

activation across the alphabet (activation scores shown

between brackets) The patterns suggest that the network

has become sensitive to both morphemic and ortho-

graphic features but is flexible as to where these patterns

should occur in the filterrsquos slots

Filter ID Pos 1st char 2nd char 3rd char

Filter 17 (lsquoickrsquo) 1 i (108) k (106) c (081)

2 k (171) c (147) i (067)

3 k (346) c (248) i (i17)

Filter 40

(lsquoakrsquo lsquoackrsquo

lsquoaakrsquo lsquoaacrsquo )

1 a (069) b (033) l (030)

2 a (061) c (059) m (056)

3 a (118) k (117) c (090)

Filter 8 (u-v-f as

consonant)

1 u (041) a (029) k (025)

2 s (059) v (048) u (047)

3 f (140) v (133) u (072)

M Kestemont et al

16 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 17: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

ReferencesBaroni M and Lenci A (2010) Distributional memory

a general framework for corpus-based semanticsComputational Linguistics 36(4) 673ndash721

Bastien F Lamblin P Pascanu R Bergstra JGoodfellow IJ Bergeron A Bouchard N andBengio Y (2012) Theano new features and speed im-provements In Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop sp

Beaufort R Roekhaut S Cougnon L and Fairon C(2010) A Hybrid RuleModel-based Finite-stateFramework for Normalizing SMS Messages In HajicJ Sandra Carberry S Clark S and Nivre J (eds) InProceedings of the 48th Annual Meeting of theAssociation for Computational Linguistics UppsalaAssociation for Computational Linguistics pp 770ndash9

Bengio Y Ducharme R Vincent P and Janvin C(2003) A neural probabilistic language model Journalof Machine Learning Research 3(March) 1137ndash55

Bengio Y Courville A and Vincent P (2013)Representation learning a review and new perspectivesIEEE Transactions on Pattern Analysis and MachineIntelligence 35(8) 1798ndash1828

Bollmann M (2012) (Semi-)Automatic Normalizationof Historical Texts using Distance Measures and theNorma tool In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Research inthe Humanities Lisbon Portugal Edicoes Colibripp 3ndash14

Bouma G and Hermans B (2012) Syllabification ofMiddle Dutch In Mambrini F Passarotti M andSporleder C (eds) In Proceedings of the SecondWorkshop on Annotation of Corpora for Researchin the Humanities Lisbon Portugal Edicoes Colibripp 27ndash38

Cahieu C F Hong H Yamins D L K Pinto NArdila D Solomon E A Majaj N J and DiCarloJJ (2014) Deep neural networks rival the representa-tion of primate IT cortex for core visual objectrecognition PloS Computational Biology 10(12)e1003963

Crystal D (2001) Language and the Internet New YorkCambridge University Press

Chrupala G Dinu G and van Genabith J (2008)Learning morphology with Morfette In Proceedings ofthe International Conference on Language Resources andEvaluation 2010 Marrakech Morocco EuropeanLanguage Resources Association pp 2362ndash7

Chrupala G (2014) Normalizing Tweets with Edit

Scripts and Recurrent Neural Embeddings In

Toutanova K and Wu H (eds) In Proceedings of the

52nd Annual Meeting of the Association for

Computational Linguistics (Volume 2 Short Papers)

Baltimore Maryland Association for Computational

Linguistics pp 680ndash6 sp

Daelemans W Groenewald H J and Van Huyssteen

G B (2009) Prototype-based Active Learning for

Lemmatization In Angelova G Bontcheva K

Mitkov R and Nicolov N (eds) In Proceedings of

the International Conference Recent Advances in

Natural Language Processing (RANLP 2009) Borovets

Association for Computational Linguistics pp 65ndash70

De Pauw G and De Schryver G (2008) Improving the

computational morphological analysis of a Swahili

corpus for lexicographic purposes Lexikos 18 303ndash18

Dieleman S and Schrauwen B (2014) End-to-end

Learning for Music Audio In International Conference

on Acoustics Speech and Signal Processing (ICASSP)

Florence IEEE pp 6964ndash8

Ernst-Gerlach A and Fuhr N (2006) Generating

Search Term Variants for Text Collections with

Historic Spellings In Lalmas M MacFarlane A

Ruger S Tombros A Tsikrika T Yavlinsky A

(eds) Advances in Information Retrieval Vol 3936

Berlin Springer pp 49ndash60

Firth J R (1957) A Synopsis of Linguistic Theory 1930-

1955 In Studies in Linguistic Analysis Oxford

Philological Society pp 1ndash32

Godin F Vandersmissen B Jalalvand A De Neve W

and Van de Walle R (2014) Alleviating ManualFeature Engineering for Part-of-speech Tagging of

Twitter Microposts Using Distributed Word

Representations In Proceedings of the Workshop on

Modern Machine Learning and Natural Language

Processing Montreal 2015 sp

Gysseling M (1977) Corpus van Middelnederlandse tek-

sten (tot en met het jaar 1300) Reeks I Ambtelijke

bescheiden 9 vols Nijhoff The Hague

Gysseling M (1980-1987) Corpus van Middelnederlandse

teksten (tot en met het jaar 1300) Reeks II Literaire

teksten 6 vols Nijhoff The Hague

Harris Z (1954) Distributional structure Word 10(23)

146ndash62

Hendrickx I and Marquilhas R (2011) From old texts

to modern spellings an experiment in automatic nor-

malisation Journal for Language Technology and

Computational Linguistics 26(2) 65ndash76

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 17 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 18: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Hochreiter S and Schmidhuber J (1997) LongShort-Term Memory Neural Computation 9(8)1735ndash80

Institute for Dutch Lexicography (1998) Cd-romMiddelnederlands In Woordenboek en Teksten SduThe Hague

Jurafsky D and Martin J H (2009) An Introduction toNatural Language Processing Computational Linguisticsand Speech Recognition 2nd edn New Jersey PearsonPrentice Hall

Kalchbrenner N Grefenstette E and Blunsom P(2014) A Convolutional Neural Network forModelling Sentences In Proceedings of the 52ndAnnual Meeting of the Association for ComputationalLinguistics Baltimore MD ACL pp 655ndash5

Kestemont M Daelemans W and De Pauw G(2010) Weigh your wordsmdashmemory-based lemmatiza-tion for middle Dutch Literary and LinguisticComputing 25(3) 287ndash301

Knowles G and Mohd Don Z (2004) The notionof a lsquolsquolemmarsquorsquo Headwords roots and lexicalsets International Journal of Corpus Linguistics 9(1)69ndash81

LeCun Y Boser B Denker J S Howard R EHabbard W Jackel L D and Henderson D(1990) Handwritten Digit Recognition with a Back-propagation Network In Touretzky D S (ed)Advances in Neural Information Processing SystemsSan Francisco Morgan Kaufmann

LeCun Y Bengio Y and Hinton G (2015) Deeplearning Nature 521(7553) 436ndash44

Levenshtein V (1966) Binary codes capable of correct-ing deletions insertions and reversals Soviet PhysicsDoklady 10(9) 707ndash10

Levy O and Golberg Y (2014) Linguistic Regularitiesin Sparse and Explicit Word Representations InProceedings of the Eighteenth Conference onComputational Natural Language Learning BaltimoreACL pp 171ndash80

Ljubesic N Erjavec T and Fiser D (2014)Standardizing tweets with character-level machinetranslation In Gelbukh A (ed) ComputationalLinguistics and Intelligent Text Processing 8404 BerlinSpringer pp 164ndash75

Lyras D P Sgarbas K N and Fakotakis N D (2008)Applying similarity measures for automatic lemmatiza-tion a case-study for modern Greek and EnglishInternational Journal on Artificial Intelligence Tools17 1043ndash64

Manning C D Prabhaker P and Schutze H (2008)

Introduction to Information Retrieval New York

Cambridge University Press

Manning C D (2015) Computational linguistics and

deep learning Computational Linguistics 41(4) 701ndash7

Mikolov T Sutskever I Chen K Corrado G S and

Dean J (2013) Distributed representations of words

and phrases and their compositionality Neural

Information Processing Systems 26 3111ndash9

Mitankin P Gerjikov S and Mihov S (2014) An ap-

proach to unsupervised historical text normalisation In

Antonacopoulos A and Schulz K (eds) In Proceedings

of the First International Conference on Digital Access to

Textual Cultural Heritage New York Association for

Computing Machinery pp 29ndash34

Nair V and Hinton G (2010) Rectified Linear Units

Improve Restricted Boltzmann Machines In Furnkranz

J and Joachims T (eds) In Proceedings of the 27th

International Conference on Machine Learning

Madison WI Omnipress pp 807ndash14

Pedregosa F Varoquaux G Gramfort A Michel V

Thirion B Grisel O Blondel M Prettenhofer P

Weiss R Dubourg V Vanderplas J Passos A

Cournapeau D Brucher M Perrot M and

Duchesnay E (2011) Scikit-learn machine learning

in python Journal of Machine Learning Research 12

2825ndash30

Piotrowski M (2012) Natural Language Processing for

Historical Texts California Morgan amp Claypool

Publishers

Reynaert M Hendrickx I and Marquilhas R (2012)

Historical spelling normalization A comparison of two

statistical methods TICCL and VARD2 In Mambrini

F Passarotti M and Sporleder C (eds) In Proceedings

of the Second Workshop on Annotation of Corpora for

Research in the Humanities Lisbon Portugal Edicoes

Colibri pp 87ndash98

Scherrer Y and Erjavec T (2013) Modernizing histor-

ical Slovene words with character-based SMT In

Piskorski J Pivovarova L Tanev H and Yangarber

R (eds) In Proceedings of the 4th Biennial Workshop on

Balto-Slavic Natural Language Processing Sofia

Bulgaria Association for Computational Linguistics

pp 58ndash62

Schulz S De Pauw G De Clerq O Desmet B Hoste V

Daelemans W and Macken L (2016) Multi-modular

text normalization of dutch user-generated content

ACM Transactions on Intelligent Systems and Technology

forthcoming NewYork NY USA ACM 7(4)1ndash22

M Kestemont et al

18 of 19 Digital Scholarship in the Humanities 2016

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from

Page 19: Lemmatization for variation-rich languages using deep learningpdfs.semanticscholar.org/060c/e6591fc3b95cb7a89643f0d9ae88a9a5ad43.pdftask has long been considered solved for modern

Souvay G and Pierrel J-M (2009) LGeRMLemmatisation des Mots en Moyen FrancaisTraitement Automatique des Langues 50(2) 149ndash72

Srivastava N Hinton G Krizhevsky A Sutskever Iand Salakhutdinov R (2014) Dropout a simple wayto prevent neural networks from overfitting Journal ofMachine Learning Research 15 1929ndash58

Toutanova K Klein D Manning C and Singer Y(2003) Feature-rich Part-of-speech Tagging witha Cyclic Dependency Network In Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology Vol 1Stroudsburg Association for ComputationalLinguistics pp 173ndash80

Van Dalen-Oskam K and Van Zundert J (2007) Deltafor Middle Dutch ndash author and copyist distinction inWalewein Literary and Linguistic Computing 22(3)345ndash62

Van der Maaten L and Hinton G (2008) Visualizinghigh-dimensional data using t-SNE Journal of MachineLearning Research 92579ndash605

Van der Voort van der Kleij J (2005) ReverseLemmatization of the Dictionary of Middle Dutch(1885-1929) Using Pattern Matching In Kiefer FKiss G and Pajzs J (eds) Papers in ComputationalLexicography Budapest Hungarian Academy ofSciences pp 203ndash10

Van Halteren H and Rem M (2013) Dealing withorthographic variation in a tagger-lemmatizer for four-teenth century Dutch charters Language Resources andEvaluation 47(4) 1233ndash59

Van Kerckvoorde C (1992) An Introduction to MiddleDutch The Hague Mouton de Gruyter

Zavrel J and Daelemans W (1999) Recent Advances inMemory-Based Part-of-Speech Tagging In Actas del VI

Simposio Internacional de Comunicacion Social Centrode linguıstica Aplicada Santiago de Cuba pp 590ndash7

Zeiler MD (2012) Adadelta an adaptive learning ratemethod ArXiv 12125701v1

Zhang X Zhao J and LeCun Y (2015) Character-levelconvolutional networks for text classification NeuralInformation Processing Systems 28 sp

Notes1 An example of over-eager expansion might for instance

create confusion in the case of minimal pairs hoer andhaer are both allowed under the lemma lsquohaarrsquo (lsquoherrsquo)but applying the same vowel transition (lsquooersquo gt lsquoaersquo) tothe token maer under the lemma lsquomaarrsquo (lsquobutrsquo) wouldyield the token moer which is primarily attested underlemmas such as lsquomuurrsquo (lsquowallrsquo) lsquomoederrsquo (lsquomotherrsquo) orlsquomodderrsquo (lsquomudrsquo)

2 In computer vision convolutional layers are tradition-ally alternated with max-pooling layers although thesurplus value of max-pooling is increasingly ques-tioned In preliminary experiments max-pooling didnot yield a clear beneficial effect nor did the introduc-tion of stacked convolutions

3 httpscodegooglecomarchivepword2vec4 httpsradimrehurekcomgensim5 httpsgithubcommikekestemonttag6 httpkerasio7 The most comprehensive discussion of the annotation

guidelines has been provided in the Adelheid projectand can be consulted online httpadelheidruhostingnl (last accessed 6 November 2014)

8 Online at httpgtbinlnl (last accessed 6 November2014)

9 See httpswwwnederlabnlhome (last accessed 6November 2014)

10 Based on personal communication with the authors

Lemmatization for variation-rich languages

Digital Scholarship in the Humanities 2016 19 of 19

by guest on January 3 2017httpdshoxfordjournalsorg

Dow

nloaded from