arXiv:1808.03738v1 [cs.CL] 11 Aug 2018 · Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian Qu Data Intelligence Laboratory, Sichuan University, Chengdu, China [email protected] [email protected]

Ancient-Modern Chinese Translation with a Large Training Dataset

Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian QuData Intelligence Laboratory, Sichuan University, Chengdu, China

[email protected]@scu.edu.cn

AbstractAncient Chinese brings the wisdom and spiritculture of the Chinese nation. Automaticallytranslation from ancient Chinese to modernChinese helps to inherit and carry forwardthe quintessence of the ancients. In this pa-per, we propose an Ancient-Modern Chineseclause alignment approach and apply it to cre-ate a large scale Ancient-Modern Chinese par-allel corpus which contains ∼1.24M bilingualpairs. To our best knowledge, this is thefirst large high-quality Ancient-Modern Chi-nese dataset.1 Furthermore, we train the SMTand various NMT based models on this datasetand provide a strong baseline for this task.

1 Introduction

Ancient Chinese is the writing language in ancientChina. It is a treasure of Chinese culture whichbrings together the wisdom and ideas of the Chi-nese nation and chronicles the ancient cultural her-itage of China. Learning ancient Chinese not onlyhelps people to understand the wisdom of the an-cients and inherit, but also promotes people to ab-sorb and develop Chinese culture. 2

However, it is difficult for modern people toread ancient Chinese. Firstly, compared with mod-ern Chinese, ancient Chinese is concise and short,its order of language changes greatly. Secondly,most modern Chinese words are double sylla-bles, while most of the ancient Chinese words aremonosyllabic. Thirdly, there is more than one pol-ysemous phenomenon in ancient Chinese. In ad-dition, manual translation has a high cost. There-fore, it is meaningful and useful to study the au-tomatic translation from ancient Chinese to mod-ern Chinese. Through Ancient-Modern Chinese

1We will release the dataset and source code upon accep-tance.

2The concept of Ancient Chinese in this paper almostrefers to the cultural/historical notion of literary Chinese(called WenYanWen in Chinese).

Translation, the wisdom, talent and accumulatedexperience of the predecessors can be passed onto more people.

Neural machine translation (NMT) (Sutskeveret al., 2014; Bahdanau et al., 2014; Wu et al.,2016) has achieved remarkable performance onmany bilingual translation tasks. It is an end-to-end learning approach for machine translation,with the potential to show great advantages overthe statistic machine translation (SMT) systems.However, NMT approach has not been widely ap-plied to the Ancient-Modern Chinese translationtask. One of the main reasons is the limited high-quality parallel data resource.

The most popular method of acquiring trans-lation examples is bilingual text alignment (Kajiet al., 1992). This kind of method can be classi-fied into two types: lexical-based and statistical-based. The lexical-based approaches (Wang andRen, 2005; Kit et al., 2004) focus on lexical in-formation, which utilizes the bilingual dictionaryor lexical features. While the statistical-based ap-proaches (Brown et al., 1991; Gale and Church,1993) rely on statistical information, such as sen-tence length ratio in two languages and align modeprobability.

However, these methods are designed for otherbilingual language pairs. The Ancient-ModernChinese has some characteristics that are quite dif-ferent from other language pairs. For example,Ancient and Modern Chinese are both written inChinese characters, but ancient Chinese is highlyconcise and its syntactical structure is differentfrom modern Chinese. The traditional methods donot take these characteristics into account. In thispaper, we propose an effective Ancient-ModernChinese text alignment method at the level ofclause3 based on the characteristics of these two

3The clause alignment is more fine-grained than sentencealignment. In the experiment, a sentence had been split into

arX

iv:1

808.

0373

8v1

[cs

.CL

] 1

1 A

ug 2

018

languages. The proposed method combines bothlexical-based information and statistical-based in-formation, which achieves 94.2 F1-score on ourmanual annotation test set. Recently, (Zhang et al.,2018) propose a simple longest common subse-quence based approach for Ancient-Modern Chi-nese sentence alignment. Our experiments showedthat our proposed alignment approach performsmuch better than their method.

We apply the proposed method to create alarge translation parallel corpus which contains∼1.24M bilingual sentence pairs. To our bestknowledge, this is the first large high-qualityAncient-Modern Chinese dataset.4 Furthermore,we test SMT models and various NMT based mod-els on the created dataset and provide a strongbaseline for this task.

2 Creating Large Training Dataset

2.1 Overview

There are four steps to build the Ancient-ModernChinese Translation dataset: (i) The parallel cor-pus crawling and cleaning. (ii) The paragraphalignment. (iii) The clause alignment based onaligned paragraphs. (iv) Augmenting data bymerging aligned adjacent clauses. The most criti-cal step is the third step.

2.2 Clause Alignment

In the clause alignment step, we combine bothstatistical-based and lexical-based informationto measure the score for each possible clausealignment between Ancient and Modern Chinesestrings. The dynamic programming is employedto further find overall optimal alignment para-graph by paragraph. According to the characteris-tics of the ancient and modern Chinese languages,we consider the following factors to measure thealignment score d(s, t) between a bilingual clausepair:Lexical Matching. The lexical matching score isused to calculate the matching coverage of the an-cient clause s. It contains two parts: exact match-ing and dictionary matching. An ancient Chinesecharacter usually corresponds to one or more mod-

clauses when we meet comma, semicolon, period or excla-mation mark.

4The dataset in (Zhang et al., 2018) contains only 57391sentence pairs, the dataset in (Lin and Wang, 2007) only in-volves 205 Chinese Ancient-Modern paragraph pairs, and thedataset in (Liu and Wang, 2012) only involves one historybook.

ern Chinese words. In the first part, we carry outChinese Words segmentation to the modern Chi-nese clause t. Then we match the ancient char-acters and modern words in the order from left toright.5 In further matching, the words that havebeen matched will be deleted from the originalclauses. However, some ancient characters do notappear in its corresponding words. An ancientChinese dictionary is employed to address this is-sue. We preprocess the ancient Chinese dictio-nary and remove the stop words. In this dictionarymatching part, we retrieve the dictionary definitionof each unmatched ancient character and use it tomatch the remaining modern Chinese words. Toreduce the impact of universal word matching, weuse Inverse Document Frequency (IDF) to weightthe matching words. The lexical matching score iscalculated as:

L(s, t) =1

|s|∑c∈s

1t

(c)+ (1)

1

|s|∑c∈s

max

(1,∑k

idfk10· 1t(dck)).

The first term of equation (1) represents co-occurrence matching score. |s| denotes the lengthof s, c denotes an ancient character in s, and the in-dicator function 1t

(c)

indicates whether the char-acter c can match the words in the clause t. Thesecond term is dictionary matching score. Wheres and t represent the remaining unmatched stringsof s and t, respectively. dck denotes the k-th char-acter in the dictionary definition of the c and itsnormalized IDF is denoted as idfk

10 .Statistical Information. Similar to (Gale andChurch, 1993) and (Wang and Ren, 2005), thestatistical information contains alignment modeand length information. There are many align-ment modes between ancient and modern Chineselanguages. If one ancient Chinese clause alignstwo adjacent modern Chinese clauses, we call thisalignment as 1-2 alignment mode. In this paper,we only consider 1-0, 0-1, 1-1, 1-2, 2-1 and 2-2alignment modes which account for 99.4% of thevalidation set. We estimate the probability Pr(n-m) of each alignment mode n-m on the validationset. To utilize length information, we make aninvestigation on length correlation between thesetwo languages. Based on the assumption of (Gale

5When an ancient character appears in a modern word, wedefine the character to exact match the word.

and Church, 1993) that each character in one lan-guage gives rise to a random number of charactersin the other language and those random variablesδ are independent and identically distributed witha normal distribution, we estimate the mean µ andstandard deviation σ from the paragraph alignedparallel corpus. Given a clause pair (s, t), the sta-tistical information score can be calculated by:

S(s, t) = ϕ( |s|/|t| − µ

σ

)· Pr(n-m), (2)

where ϕ(·) denotes the normal distribution proba-bility density function.Edit Distance. Edit distance is a way of quantify-ing the dissimilarity between two strings by count-ing the minimum number of operations (insertion,deletion, and substitution) required to transformone string into the other. Here we define the editdistance score as:

E(s, t) = 1− EditDis(s, t)

max(|s|, |t|)(3)

Dynamic Programming. The overall alignmentscore for each possible clause alignment is as fol-lows:

d(s, t) = L(s, t) + γS(s, t) + λE(s, t). (4)

Here γ and λ are pre-defined interpolation fac-tors. We use dynamic programming to find theoverall optimal alignment paragraph by paragraph.Let D(i, j) be total alignment scores of align-ing clauses from first one to i-th ancient Chineseclauses and modern Chinese clauses from first oneto j-th, and the recurrence then can be describedas follows:

D(i, j) = max0≤n,m≤2

{D(i− n, j −m)+

d(si−n:i, tj−m:j)} (5)

where n and m denotes the alignment mode n-m,and si−n:i denotes the i-th ancient Chinese clauseand its previous n− 1 ancient Chinese clause.

2.3 Ancient-Modern Chinese DatasetData Collection. To build the large Ancient-Modern Chinese dataset, we collected 1.7K bilin-gual ancient-modern Chinese articles from the in-ternet. More specifically, a large part of the ancientChinese data we used come from ancient Chinesehistory records in several dynasties (about 200BC-1000BC) and articles written by celebrities of that

era. They used plain and accurate words to ex-press what happened at that time, thus ensure thegenerality of the translated materials. The work ofparagraph alignment is manually completed. Af-ter data cleaning and manual paragraph alignment,we obtained 35K aligned bilingual paragraphs.

We applied our clause alignment algorithm onthe 35K aligned bilingual paragraphs and obtained517K aligned bilingual clauses. Furthermore, weaugmented the data in the following way: Givenan aligned clause pair, we merge its adjacentclause pairs as a training pair. After the data aug-mentation, we filtered the sentences longer than50. Our experiments showed that this augmen-tation technique can greatly improve the perfor-mance of the NMT model. Finally, we split thedataset into three sets: training (Train), develop-ment (Dev) and testing (Test). The statistical in-formation of the three data sets is shown in Table1. Note that all the sentences in different sets comefrom different articles. We show some examplesof data in the Appendix.

Set Pairs Src Token. Target Token.Train 1.02M 12.56M 18.35MDev 125.71K 1.51M 2.38MTest 100.63K 1.19M 1.87M

Table 1: Statistical information of the Ancient-ModernChinese dataset. Src Token: the number of tokens insource language.

3 Experiments

3.1 Clause Alignment Results

In order to evaluate our clause align algorithm, wemanually aligned bilingual clauses from 37 bilin-gual ancient-modern Chinese articles, and finallygot 4K aligned bilingual clauses as the test set and2K clauses as the validation set.

We evaluated the clause align algorithm withvarious settings on the test set. In addition, wealso compared our method with the longest com-mon subsequence (LCS) based approach proposedby (Zhang et al., 2018). We estimated µ and σon all aligned paraphrases. The probability Pr(n-m) of each alignment mode n-m was estimated onthe validation set. The grid search was applied tosearch for the hyper-parameters γ and λ on the val-idation set. The Jieba Chinese text segmentation6

6A Python based Chinese word segmentation module

is employed for modern Chinese word segmenta-tion.

We used F1-score and precision as the evalu-ation metrics. As shown in Table 2, the abbrevia-tion w/o means removing a particular part from thesetting. We find that the lexical matching score ismost important among these three factors, and sta-tistical information score is more important thanedit distance score. Moreover, the dictionary termin lexical matching score greatly improves the per-formance. From these results, we obtain the bestsetting that involves all these three factors. Weused this setting for clause alignment. Further-more, the proposed method performs much betterthan LCS (Zhang et al., 2018).

Setting F1 Precisionall 94.2 94.8w/o lexical score 84.3 86.5w/o statistical score 92.8 93.9w/o edit distance 93.9 94.4w/o dictionary 93.1 93.9LCS (Zhang et al., 2018) 91.3 92.2

Table 2: Evaluation results of different settings on thetest set. w/o: without. w/o dictionary: without usingdictionary term in lexical matching score.

3.2 Translation Results

We train the SMT and various NMT based modelson the dataset.SMT. The state-of-art Moses toolkit (Koehn et al.,2007) was used to train SMT model. We usedKenLM (Heafield, 2011) to train a 5-gram lan-guage model, and the GIZA++ toolkit to align thedata.NMT. The basic NMT model is based on (Bah-danau et al., 2014). Furthermore, we tested the ba-sic NMT model with several techniques, such astarget language reversal, residual connection (Heet al., 2016) and pre-trained word2vec (Mnih andKavukcuoglu, 2013).Transformer. We also trained the Transformermodel (Vaswani et al., 2017) which is a strongbaseline.

The hyper-parameters and generated samples ofabove models are shown in the Appendix. Toverify the effectiveness of our data augmentedmethod. We tested the NMT and SMT models on

https://github.com/fxsjy/jieba.

Model 1-gram 2-gram 3-gram 4-gramSMT 50.24 38.05 29.93 24.04+ Augment 51.70 39.09 30.73 24.72

basic NMT 23.43 17.24 13.16 10.26+ Reversal 29.84 21.55 16.13 12.39+ Residual 33.66 23.52 17.25 13.03+ Word2vec 34.29 25.11 19.02 14.74+ Augment 53.35 40.10 31.51 25.42

Transformer 51.23 37.99 29.49 23.49

Table 3: 1-4 gram BLEUS results on various models.

both unaugmented dataset (including 0.46M train-ing pairs) and augmented dataset.

For the evaluation, we used the average of 1to 4 gram BLEUs (Papineni et al., 2002) whichcomputed by multi-bleu.perl in Moses as metrics.The results are reported in Table 3. For NMT,we can see that target language reversal, resid-ual connection, and word2vec can further improvethe performance of the basic NMT model. More-over, the results show that training NMT modelon augmented data can greatly improve the per-formance. For SMT, it performs better than NMTmodels when they were both trained on the unaug-mented dataset. However, when trained on theaugmented dataset, the NMT model outperformsthe SMT model. It indicates that a large amountof data is necessary for NMT model. In addition,training Transformer is very fast, but the perfor-mance of it is slight worse than SMT and full NMTmodels.

4 Conclusion

We propose an effective Ancient-Modern Chineseclause alignment method which achieves 94.2 F1-score on test set. Based on it, we build a large scaleparallel corpus which contains ∼1.24M bilingualsentence pairs. To our best knowledge, this is thefirst large high-quality Ancient-Modern Chinesedataset. In addition, we provide a strong NMTbaseline for this task which achieves 25.42 BLEUscore (4-gram).

5 Appendix

5.1 NMT Configurations

The basic NMT model is based on (Bahdanauet al., 2014). Both the encoder and decoder used2-layer RNN with 1024 LSTM cells, and the en-coder is a bi-directional RNN. The batch size,threshold of element-wise gradient clipping andinitial learning rate of Adam optimizer (Kingmaand Ba, 2014) were set to 128, 5.0 and 0.001.When trained the model on augmented dataset,we used 4-layer RNN. Several techniques wereinvestigated to train the model, including layer-normalization (Ba et al., 2016), RNN-dropout(Gal and Ghahramani, 2016), and learning rate de-cay (Wu et al., 2016). The hyper-parameters werechosen empirically and adjusted in the validation.For word embedding pre-training, we collected anexternal ancient corpus which contains∼134M to-kens.

5.2 Transformer Configurations

The training configuration of the Transformer7

(Vaswani et al., 2017) model is shown in Table 4.

Setting ValuesBatch Size 64Inner Hidden Size 1024Word Embedding Size 512Dropout Rate 0.1Num Heads h 16Num Layers N 6dk 64dv 64dmodel 512

Table 4: The training configuration of the Transformer.

5.3 Samples

Some data samples of the Ancient-Modern Chi-nese parallel corpus are shown in Figure 1. Thegenerated samples of various models are shown inFigure 2.

ReferencesJimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-

ton. 2016. Layer normalization. arXiv preprintarXiv:1607.06450.7The implement of the Transformer model is based on

https://github.com/Kyubyong/transformer.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Peter F Brown, Jennifer C Lai, and Robert L Mercer.1991. Aligning sentences in parallel corpora. InProceedings of the 29th annual meeting on Associa-tion for Computational Linguistics, pages 169–176.Association for Computational Linguistics.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In Advances in neural informationprocessing systems, pages 1019–1027.

William A Gale and Kenneth W Church. 1993. Aprogram for aligning sentences in bilingual corpora.Computational linguistics, 19(1):75–102.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE conference oncomputer vision and pattern recognition, pages 770–778.

Kenneth Heafield. 2011. Kenlm: Faster and smallerlanguage model queries. In Proceedings of the SixthWorkshop on Statistical Machine Translation, pages187–197. Association for Computational Linguis-tics.

Hiroyuki Kaji, Yuuko Kida, and Yasutsugu Morimoto.1992. Learning translation templates from bilingualtext. In Proceedings of the 14th conference on Com-putational linguistics-Volume 2, pages 672–678. As-sociation for Computational Linguistics.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Chunyu Kit, Jonathan J Webster, King-Kui Sin, HaihuaPan, and Heng Li. 2004. Clause alignment for hongkong legal texts: A lexical-based approach. Interna-tional Journal of Corpus Linguistics, 9(1):29–51.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, et al. 2007. Moses: Open sourcetoolkit for statistical machine translation. In Pro-ceedings of the 45th annual meeting of the ACL oninteractive poster and demonstration sessions, pages177–180. Association for Computational Linguis-tics.

Zhun Lin and Xiaojie Wang. 2007. Chinese ancient-modern sentence alignment. In International Con-ference on Computational Science, pages 1178–1185. Springer.

Ying Liu and Nan Wang. 2012. Sentence align-ment for ancient and modern chinese parallel cor-pus. In Emerging Research in Artificial Intelli-gence and Computational Intelligence, pages 408–415. Springer.

Andriy Mnih and Koray Kavukcuoglu. 2013. Learningword embeddings efficiently with noise-contrastiveestimation. In Advances in neural information pro-cessing systems, pages 2265–2273.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic eval-uation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, pages 6000–6010.

Xiaojie Wang and Fuji Ren. 2005. Chinese-japaneseclause alignment. In International Conference onIntelligent Text Processing and Computational Lin-guistics, pages 400–412. Springer.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc VLe, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, et al. 2016. Google’s neural ma-chine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprintarXiv:1609.08144.

Zhiyuan Zhang, Wei Li, and Xu Sun. 2018. Automatictransferring between ancient chinese and contempo-rary chinese. arXiv preprint arXiv:1803.01557.

Figure 1: Some data samples of the Ancient-Modern Chinese parallel corpus.

Figure 2: Some generated samples of various models.

Documents

arXiv:1808.03738v1 [cs.CL] 11 Aug 2018 · Dayiheng Liu, Jiancheng Lv, Kexin Yang, Qian Qu Data Intelligence Laboratory, Sichuan University, Chengdu, China [email protected] [email protected]