11
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 133–143, Baltimore, Maryland, USA, June 23-25 2014. c 2014 Association for Computational Linguistics Learning Topic Representation for SMT with Neural Networks * Lei Cui 1 , Dongdong Zhang 2 , Shujie Liu 2 , Qiming Chen 3 , Mu Li 2 , Ming Zhou 2 , and Muyun Yang 1 1 School of Computer Science and Technology, Harbin Institute of Technology, Harbin, P.R. China [email protected], [email protected] 2 Microsoft Research, Beijing, P.R. China {dozhang,shujliu,muli,mingzhou}@microsoft.com 3 Shanghai Jiao Tong University, Shanghai, P.R. China [email protected] Abstract Statistical Machine Translation (SMT) usually utilizes contextual information to disambiguate translation candidates. However, it is often limited to contexts within sentence boundaries, hence broader topical information cannot be leveraged. In this paper, we propose a novel approach to learning topic representation for paral- lel data using a neural network architec- ture, where abundant topical contexts are embedded via topic relevant monolingual data. By associating each translation rule with the topic representation, topic rele- vant rules are selected according to the dis- tributional similarity with the source text during SMT decoding. Experimental re- sults show that our method significantly improves translation accuracy in the NIST Chinese-to-English translation task com- pared to a state-of-the-art baseline. 1 Introduction Making translation decisions is a difficult task in many Statistical Machine Translation (SMT) sys- tems. Current translation modeling approaches usually use context dependent information to dis- ambiguate translation candidates. For exam- ple, translation sense disambiguation approaches (Carpuat and Wu, 2005; Carpuat and Wu, 2007) are proposed for phrase-based SMT sys- tems. Meanwhile, for hierarchical phrase-based or syntax-based SMT systems, there is also much work involving rich contexts to guide rule selec- tion (He et al., 2008; Liu et al., 2008; Marton and Resnik, 2008; Xiong et al., 2009). Although these methods are effective and proven successful in many SMT systems, they only leverage within- * This work was done while the first and fourth authors were visiting Microsoft Research. sentence contexts which are insufficient in explor- ing broader information. For example, the word driver often means “the operator of a motor ve- hicle” in common texts. But in the sentence “Fi- nally, we write the user response to the buffer, i.e., pass it to our driver”, we understand that driver means “computer program”. In this case, people understand the meaning because of the IT topical context which goes beyond sentence-level analy- sis and requires more relevant knowledge. There- fore, it is important to leverage topic information to learn smarter translation models and achieve better translation performance. Topic modeling is a useful mechanism for dis- covering and characterizing various semantic con- cepts embedded in a collection of documents. At- tempts on topic-based translation modeling in- clude topic-specific lexicon translation models (Zhao and Xing, 2006; Zhao and Xing, 2007), topic similarity models for synchronous rules (Xiao et al., 2012), and document-level translation with topic coherence (Xiong and Zhang, 2013). In addition, topic-based approaches have been used in domain adaptation for SMT (Tam et al., 2007; Su et al., 2012), where they view different topics as different domains. One typical property of these approaches in common is that they only utilize parallel data where document boundaries are ex- plicitly given. In this way, the topic of a sentence can be inferred with document-level information using off-the-shelf topic modeling toolkits such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003) or Hidden Topic Markov Model (HTMM) (Gruber et al., 2007). Most of them also assume that the input must be in document level. However, this situation does not always happen since there is considerable amount of parallel data which does not have document boundaries. In addition, con- temporary SMT systems often works on sentence level rather than document level due to the effi- ciency. Although we can easily apply LDA at the 133

Learning Topic Representation for SMT with Neural … Background: Deep Learning ... research converts a high-dimensional and sparse binary representation into a low-dimensional and

  • Upload
    lykien

  • View
    223

  • Download
    2

Embed Size (px)

Citation preview

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 133–143,Baltimore, Maryland, USA, June 23-25 2014. c©2014 Association for Computational Linguistics

Learning Topic Representation for SMT with Neural Networks∗

Lei Cui1, Dongdong Zhang2, Shujie Liu2, Qiming Chen3, Mu Li2, Ming Zhou2, and Muyun Yang1

1School of Computer Science and Technology, Harbin Institute of Technology, Harbin, P.R. [email protected], [email protected]

2Microsoft Research, Beijing, P.R. China{dozhang,shujliu,muli,mingzhou}@microsoft.com

3Shanghai Jiao Tong University, Shanghai, P.R. [email protected]

Abstract

Statistical Machine Translation (SMT)usually utilizes contextual informationto disambiguate translation candidates.However, it is often limited to contextswithin sentence boundaries, hence broadertopical information cannot be leveraged.In this paper, we propose a novel approachto learning topic representation for paral-lel data using a neural network architec-ture, where abundant topical contexts areembedded via topic relevant monolingualdata. By associating each translation rulewith the topic representation, topic rele-vant rules are selected according to the dis-tributional similarity with the source textduring SMT decoding. Experimental re-sults show that our method significantlyimproves translation accuracy in the NISTChinese-to-English translation task com-pared to a state-of-the-art baseline.

1 Introduction

Making translation decisions is a difficult task inmany Statistical Machine Translation (SMT) sys-tems. Current translation modeling approachesusually use context dependent information to dis-ambiguate translation candidates. For exam-ple, translation sense disambiguation approaches(Carpuat and Wu, 2005; Carpuat and Wu,2007) are proposed for phrase-based SMT sys-tems. Meanwhile, for hierarchical phrase-basedor syntax-based SMT systems, there is also muchwork involving rich contexts to guide rule selec-tion (He et al., 2008; Liu et al., 2008; Martonand Resnik, 2008; Xiong et al., 2009). Althoughthese methods are effective and proven successfulin many SMT systems, they only leverage within-

∗This work was done while the first and fourth authorswere visiting Microsoft Research.

sentence contexts which are insufficient in explor-ing broader information. For example, the worddriver often means “the operator of a motor ve-hicle” in common texts. But in the sentence “Fi-nally, we write the user response to the buffer, i.e.,pass it to our driver”, we understand that drivermeans “computer program”. In this case, peopleunderstand the meaning because of the IT topicalcontext which goes beyond sentence-level analy-sis and requires more relevant knowledge. There-fore, it is important to leverage topic informationto learn smarter translation models and achievebetter translation performance.

Topic modeling is a useful mechanism for dis-covering and characterizing various semantic con-cepts embedded in a collection of documents. At-tempts on topic-based translation modeling in-clude topic-specific lexicon translation models(Zhao and Xing, 2006; Zhao and Xing, 2007),topic similarity models for synchronous rules(Xiao et al., 2012), and document-level translationwith topic coherence (Xiong and Zhang, 2013). Inaddition, topic-based approaches have been usedin domain adaptation for SMT (Tam et al., 2007;Su et al., 2012), where they view different topicsas different domains. One typical property of theseapproaches in common is that they only utilizeparallel data where document boundaries are ex-plicitly given. In this way, the topic of a sentencecan be inferred with document-level informationusing off-the-shelf topic modeling toolkits suchas Latent Dirichlet Allocation (LDA) (Blei et al.,2003) or Hidden Topic Markov Model (HTMM)(Gruber et al., 2007). Most of them also assumethat the input must be in document level. However,this situation does not always happen since there isconsiderable amount of parallel data which doesnot have document boundaries. In addition, con-temporary SMT systems often works on sentencelevel rather than document level due to the effi-ciency. Although we can easily apply LDA at the

133

sentence level, it is quite difficult to infer the topicaccurately with only a few words in the sentence.This makes previous approaches inefficient whenapplied them in real-world commercial SMT sys-tems. Therefore, we need to devise a systematicalapproach to enriching the sentence and inferringits topic more accurately.

In this paper, we propose a novel approach tolearning topic representations for sentences. Sincethe information within the sentence is insufficientfor topic modeling, we first enrich sentence con-texts via Information Retrieval (IR) methods usingcontent words in the sentence as queries, so thattopic-related monolingual documents can be col-lected. These topic-related documents are utilizedto learn a specific topic representation for eachsentence using a neural network based approach.Neural network is an effective technique for learn-ing different levels of data representations. Thelevels inferred from neural network correspond todistinct levels of concepts, where high-level rep-resentations are obtained from low-level bag-of-words input. It is able to detect correlations amongany subset of input features through non-lineartransformations, which demonstrates the superior-ity of eliminating the effect of noisy words whichare irrelevant to the topic. Our problem fits wellinto the neural network framework and we expectthat it can further improve inferring the topic rep-resentations for sentences.

To incorporate topic representations as trans-lation knowledge into SMT, our neural networkbased approach directly optimizes similarities be-tween the source language and target language in acompact topic space. This underlying topic spaceis learned from sentence-level parallel data in or-der to share topic information across the sourceand target languages as much as possible. Addi-tionally, our model can be discriminatively trainedwith a large number of training instances, withoutexpensive sampling methods such as in LDA orHTMM, thus it is more practicable and scalable.Finally, we associate the learned representation toeach bilingual translation rule. Topic-related rulesare selected according to distributional similaritywith the source text, which helps hypotheses gen-eration in SMT decoding. We integrate topic simi-larity features in the log-linear model and evaluatethe performance on the NIST Chinese-to-Englishtranslation task. Experimental results demonstratethat our model significantly improves translation

accuracy over a state-of-the-art baseline.

2 Background: Deep Learning

Deep learning is an active topic in recent yearswhich has triumphed in many machine learningresearch areas. This technique began raising pub-lic awareness in the mid-2000s after researchersshowed how a multi-layer feed-forward neuralnetwork can be effectively trained. The train-ing procedure often involves two phases: a layer-wise unsupervised pre-training phase and a su-pervised fine-tuning phase. For pre-training, Re-stricted Boltzmann Machine (RBM) (Hinton etal., 2006), auto-encoding (Bengio et al., 2006)and sparse coding (Lee et al., 2006) are most fre-quently used. Unsupervised pre-training trains thenetwork one layer at a time and helps guide the pa-rameters of the layer towards better regions in pa-rameter space (Bengio, 2009). Followed by fine-tuning in this parameter region, deep learning isable to achieve state-of-the-art performance in var-ious research areas, including breakthrough resultson the ImageNet dataset for objective recognition(Krizhevsky et al., 2012), significant error reduc-tion in speech recognition (Dahl et al., 2012), etc.

Deep learning has also been successfully ap-plied in a variety of NLP tasks such as part-of-speech tagging, chunking, named entity recog-nition, semantic role labeling (Collobert et al.,2011), parsing (Socher et al., 2011a), sentimentanalysis (Socher et al., 2011b), etc. Most NLPresearch converts a high-dimensional and sparsebinary representation into a low-dimensional andreal-valued representation. This low-dimensionalrepresentation is usually learned from hugeamount of monolingual texts in the pre-trainingphase, and then fine-tuned towards task-specificcriterion. Inspired by previous successful re-search, we first learn sentence representations us-ing topic-related monolingual texts in the pre-training phase, and then optimize the bilingualsimilarity by leveraging sentence-level paralleldata in the fine-tuning phase.

3 Topic Similarity Model with NeuralNetwork

In this section, we explain our neural networkbased topic similarity model in detail, as well ashow to incorporate the topic similarity featuresinto SMT decoding procedure. Figure 1 sketchesthe high-level overview which illustrates how to

134

𝐳𝑓 = 𝑔(𝐟) 𝐳𝑒 = 𝑔(𝐞)

cos(𝐳𝑓 , 𝐳𝑒)

𝑠𝑖𝑚(𝑓, 𝑒)

𝑓 𝑒

English

document

collection

𝐝𝑓

𝐝𝑒

Parallel

sentence

IR IR

𝐟 𝐞

Chinese

document

collection

Neural Network

Training

Data

Preprocessing

Figure 1: Overview of neural network based topicsimilarity model.

learn topic representations using sentence-levelparallel data. Given a parallel sentence pair 〈f, e〉,the first step is to treat f and e as queries, anduse IR methods to retrieve relevant documents toenrich contextual information for them. Specifi-cally, the ranking model we used is a Vector SpaceModel (VSM), where the query and document areconverted into tf-idf weighted vectors. The mostrelevant N documents df and de are retrieved andconverted to a high-dimensional, bag-of-words in-put f and e for the representation learning1.

There are two phases in our neural networktraining process: pre-training and fine-tuning. Inthe pre-training phase (Section 3.1), we build twoneural networks with the same structure but differ-ent parameters to learn a low-dimensional repre-sentation for sentences in two different languages.Then, in the fine-tuning phase (Section 3.2), ourmodel directly optimizes the similarity of two low-dimensional representations, so that it highly cor-relates to SMT decoding. Finally, the learned rep-resentation is used to calculate similarities whichare integrated as features in SMT decoding proce-dure (Section 3.3).

3.1 Pre-training using denoisingauto-encoder

In the pre-training phase, we leverage neuralnetwork structures to transform high-dimensionalsparse vectors to low-dimensional dense vectors.The topic similarity is calculated on top of thelearned dense vectors. This dense representationshould preserve the information from the bag-of-

1We use f and e to denote the n-of-V vector convertedfrom the retrieved documents.

words input, meanwhile alleviate data sparse prob-lem. Therefore, we use a specially designed mech-anism called auto-encoder to solve this problem.Auto-encoder (Bengio et al., 2006) is one of thebasic building blocks of deep learning. Assum-ing that the input is a n-of-V binary vector x rep-resenting the bag-of-words (V is the vocabularysize), an auto-encoder consists of an encoding pro-cess g(x) and a decoding process h(g(x)). Theobjective of the auto-encoder is to minimize thereconstruction error L(h(g(x)), x). Our goal is tolearn a low-dimensional vector which can preserveinformation from the original n-of-V vector.

One problem with auto-encoder is that it treatsall words in the same way, making no distinguish-ment between function words and content words.The representation learned by auto-encoders tendsto be influenced by the function words, thereby itis not robust. To alleviate this problem, Vincent etal. (2008) proposed the Denoising Auto-Encoder(DAE), which aims to reconstruct a clean, “re-paired” input from a corrupted, partially destroyedvector. This is done by corrupting the initial in-put x to get a partially destroyed version x. DAEis capable of capturing the global structure of theinput while ignoring the noise. In our task, foreach sentence, we treat the retrieved N relevantdocuments as a single large document and convertit to a bag-of-words vector x in Figure 2. WithDAE, the input x is manually corrupted by apply-ing masking noise (randomly mask 1 to 0) and get-ting x. Denoising training is considered as “fillingin the blanks” (Vincent et al., 2010), which meansthe masking components can be recovered fromthe non-corrupted components. For example, inIT related texts, if the word driver is masked, itshould be predicted through hidden units in neuralnetworks by active signals such as “buffer”, “userresponse”, etc.

In our case, the encoding process transformsthe corrupted input x into g(x) with two layers:a linear layer connected with a non-linear layer.Assuming that the dimension of the g(x) is L,the linear layer forms a L × V matrix W whichprojects the n-of-V vector to a L-dimensional hid-den layer. After the bag-of-words input has beentransformed, they are fed into a subsequent layerto model the highly non-linear relations amongwords:

z = f(W x + b) (1)

where z is the output of the non-linear layer, b is a

135

𝐱 𝐱

𝑔(𝐱 )

ℎ(𝑔 𝐱 )

ℒ(ℎ 𝑔 𝐱 , 𝐱)

Figure 2: Denoising auto-encoder with a bag-of-words input.

L-length bias vector. f(·) is a non-linear function,where common choices include sigmoid function,hyperbolic function, “hard” hyperbolic function,rectifier function, etc. In this work, we use therectifier function as our non-linear function due toits efficiency and better performance (Glorot et al.,2011):

rec(x) =

{x if x > 00 otherwise

(2)

The decoding process consists of a linear layerand a non-linear layer with similar network struc-tures, but different parameters. It transforms theL-dimensional vector g(x) to a V -dimensionalvector h(g(x)). To minimize reconstruction errorwith respect to x, we define the loss function asthe L2-norm of the difference between the uncor-rupted input and reconstructed input:

L(h(g(x)), x) = ‖h(g(x))− x‖2 (3)

Multi-layer neural networks are trained with thestandard back-propagation algorithm (Rumelhartet al., 1988). The gradient of the loss functionis calculated and back-propagated to the previouslayer to update its parameters. Training neural net-works involves many factors such as the learningrate and the length of hidden layers. We will dis-cuss the optimization of these parameters in Sec-tion 4.

3.2 Fine-tuning with parallel dataIn the fine-tuning phase, we stack another layer ontop of the two low-dimensional vectors to maxi-mize the similarity between source and target lan-guages. The similarity scores are integrated intothe standard log-linear model for making transla-tion decisions. Since the vectors from DAE aretrained using information from monolingual train-ing data independently, these vectors may be in-

adequate to measure bilingual topic similarity dueto their different topic spaces. Therefore, in thisstage, parallel sentence pairs are used to help con-necting the vectors from different languages be-cause they express the same topic. In fact, the ob-jective of fine-tuning is to discover a latent topicspace which is shared by both languages as muchas possible. This shared topic space is particularlyuseful when the SMT decoder tries to match thesource texts and translation candidates in the tar-get language.

Given a parallel sentence pair 〈f, e〉, the DAElearns representations for f and e respectively, aszf = g(f) and ze = g(e) in Figure 1. We then taketwo vectors as the input to calculate their similar-ity. Consequently, the whole neural network canbe fine-tuned towards the supervised criteria withthe help of parallel data. The similarity score ofthe representation pair 〈zf , ze〉 is defined as the co-sine similarity of the two vectors:

sim(f, e) = cos(zf , ze)

=zf · ze‖zf‖‖ze‖

(4)

Since a parallel sentence pair should have thesame topic, our goal is to maximize the similar-ity score between the source sentence and targetsentence. Inspired by the contrastive estimationmethod (Smith and Eisner, 2005), for each paral-lel sentence pair 〈f, e〉 as a positive instance, weselect another sentence pair 〈f ′, e′〉 from the train-ing data and treat 〈f, e′〉 as a negative instance. Tomake the similarity of the positive instance largerthan the negative instance by some margin η, weutilize the following pairwise ranking loss:

L(f, e) = max{0, η − sim(f, e) + sim(f, e′)}(5)

where η = 12 − sim(f, f ′). The rationale behind

this criterion is, the smaller sim(f, f ′) is, the morewe should penalize negative instances.

To effectively train the model in this task, neg-ative instances must be selected carefully. Sincedifferent sentences may have very similar topicdistributions, we select negative instances that aredissimilar with the positive instances based on thefollowing criteria:

1. For each positive instance 〈f, e〉, we select e′

which contains at least 30% different contentwords from e.

136

2. If we cannot find such e′, remove 〈f, e〉 fromthe training instances for network learning.

The model minimizes the pairwise ranking lossacross all training instances:

L =∑〈f,e〉L(f, e) (6)

We used standard back-propagation algorithmto further fine-tune the neural network parametersW and b in Equation (1). The learned neural net-works are used to obtain sentence topic representa-tions, which will be further leveraged to infer topicrepresentations of bilingual translation rules.

3.3 Integration into SMT decodingWe incorporate the learned topic similarity scoresinto the standard log-linear framework for SMT.When a synchronous rule 〈α, γ〉 is extracted froma sentence pair 〈f, e〉, a triple instance I =(〈α, γ〉, 〈f, e〉, c) is collected for inferring thetopic representation of 〈α, γ〉, where c is the countof rule occurrence. Following (Chiang, 2007), wegive a count of one for each phrase pair occurrenceand a fractional count for each hierarchical phrasepair. The topic representation of 〈α, γ〉 is then cal-culated as the weighted average:

zα =

∑(〈α,γ〉,〈f,e〉,c)∈T {c× zf}∑

(〈α,γ〉,〈f,e〉,c)∈T {c}(7)

zγ =

∑(〈α,γ〉,〈f,e〉,c)∈T {c× ze}∑

(〈α,γ〉,〈f,e〉,c)∈T {c}(8)

where T denotes all instances for the rule 〈α, γ〉,zα and zγ are the source-side and target-side topicvectors respectively.

By measuring the similarity between the sourcetexts and bilingual translation rules, the SMT de-coder is able to encourage topic relevant transla-tion candidates and penalize topic irrelevant candi-dates. Therefore, it helps to train a smarter transla-tion model with the embedded topic information.Given a source sentence s to be translated, we de-fine the similarity as follows:

Sim(zs, zα) = cos(zs, zα) (9)

Sim(zs, zγ) = cos(zs, zγ) (10)

where zs is the topic representation of s. Thesimilarity calculated against zα or zγ denotes thesource-to-source or the source-to-target similarity.

We also consider the topic sensitivity estimationsince general rules have flatter distributions whiletopic-specific rules have sharper distributions. Astandard entropy metric is used to measure the sen-sitivity of the source-side of 〈α, γ〉 as:

Sen(α) = −|zα|∑i=1

zαi × log zαi (11)

where zαi is a component in the vector zα. Thetarget-side sensitivity Sen(γ) can be calculated ina similar way. The larger the sensitivity is, themore topic-specific the rule manifests.

In addition to traditional SMT features, we addnew topic-related features into the standard log-linear framework. For the SMT system, the besttranslation candidate e is given by:

e = arg maxe

P (e|f) (12)

where the translation probability is given by:

P (e|f) ∝∑i

wi · log φi(f, e)

=∑j

wj · log φj(f, e)︸ ︷︷ ︸Standard

+∑k

wk · log φk(f, e)︸ ︷︷ ︸Topic related

(13)

where φj(f, e) is the standard feature function andwj is the corresponding feature weight. φk(f, e)is the topic-related feature function and wk is thefeature weight. The detailed feature description isas follows:

Standard features: Translation model, includ-ing translation probabilities and lexical weightsfor both directions (4 features), 5-gram languagemodel (1 feature), word count (1 feature), phrasecount (1 feature), NULL penalty (1 feature), num-ber of hierarchical rules used (1 feature).

Topic-related features: rule similarity scores(2 features), rule sensitivity scores (2 features).

4 Experiments

4.1 SetupWe evaluate the performance of our neural net-work based topic similarity model on a Chinese-to-English machine translation task. In neural net-work training, a large number of monolingual doc-uments are collected in both source and target lan-guages. The documents are mainly from two do-mains: news and weblog. We use Chinese and

137

English Gigaword corpus (Version 5) which aremainly from news domain. In addition, we alsocollect weblog documents with a variety of top-ics from the web. The total data statistics arepresented in Table 1. These documents are builtin the format of inverted index using Lucene2,which can be efficiently retrieved by the paral-lel sentence pairs. The most relevant N docu-ments are collected, where we experiment withN = {1, 5, 10, 20, 50}.

Domain Chinese EnglishDocs Words Docs Words

News 5.7M 5.4B 9.9M 25.6BWeblog 2.1M 8B 1.2M 2.9BTotal 7.8M 13.4B 11.1M 28.5B

Table 1: Statistics of monolingual data, in num-bers of documents and words (main content). “M”refers to million and “B” refers to billion.

We implement a distributed framework to speedup the training process of neural networks. Thenetwork is learned with mini-batch asynchronousgradient descent with the adaptive learning rateprocedure called AdaGrad (Duchi et al., 2011).We use 32 model replicas in each iteration duringthe training. The model parameters are averagedafter each iteration and sent to each replica for thenext iteration. The vocabulary size for the inputlayer is 100,000, and we choose different lengthsfor the hidden layer as L = {100, 300, 600, 1000}in the experiments. In the pre-training phase, allparallel data is fed into two neural networks re-spectively for DAE training, where network pa-rameters W and b are randomly initialized. Inthe fine-tuning phase, for each parallel sentencepair, we randomly select other ten sentence pairswhich satisfy the criterion as negative instances.These training instances are leveraged to optimizethe similarity of two vectors.

In SMT training, an in-house hierarchicalphrase-based SMT decoder is implemented for ourexperiments. The CKY decoding algorithm isused and cube pruning is performed with the samedefault parameter settings as in Chiang (2007).The parallel data we use is released by LDC3. Intotal, the datasets contain nearly 1.1 million sen-tence pairs. Translation models are trained overthe parallel data that is automatically word-aligned

2http://lucene.apache.org/3LDC2003E14, LDC2002E18, LDC2003E07,

LDC2005T06, LDC2005T10, LDC2005E83, LDC2006E34,LDC2006E85, LDC2006E92, LDC2006E26, LDC2007T09

using GIZA++ in both directions, and the diag-grow-final heuristic is used to refine symmetricword alignment. An in-house language modelingtoolkit is used to train the 5-gram language modelwith modified Kneser-Ney smoothing (Kneser andNey, 1995). The English monolingual data usedfor language modeling is the same as in Table1. The NIST 2003 dataset is the developmentdata. The testing data consists of NIST 2004,2005, 2006 and 2008 datasets. The evaluationmetric for the overall translation quality is case-insensitive BLEU4 (Papineni et al., 2002). Thereported BLEU scores are averaged over 5 timesof running MERT (Och, 2003). A statistical sig-nificance test is performed using the bootstrap re-sampling method (Koehn, 2004).

4.2 BaselineThe baseline is a re-implementation of the Hierosystem (Chiang, 2007). The phrase pairs that ap-pear only once in the parallel data are discardedbecause most of them are noisy. We also usethe fix-discount method in Foster et al. (2006)for phrase table smoothing. This implementationmakes the system perform much better and thetranslation model size is much smaller.

We compare our method with the LDA-basedapproach proposed by Xiao et al. (2012). In (Xiaoet al., 2012), the topic of each sentence pair is ex-actly the same as the document it belongs to. Sincesome of our parallel data does not have document-level information, we rely on the IR method toretrieve the most relevant document and simulatethis approach. The PLDA toolkit (Liu et al., 2011)is used to infer topic distributions, which takes34.5 hours to finish.

4.3 Effect of retrieved documents and lengthof hidden layers

We illustrate the relationship among translationaccuracy (BLEU), the number of retrieved docu-ments (N ) and the length of hidden layers (L) ondifferent testing datasets. The results are shown inFigure 3. The best translation accuracy is achievedwhen N=10 for most settings. This confirms thatenriching the source text with topic-related doc-uments is very useful in determining topic repre-sentations, thereby help to guide the synchronousrule selection. However, we find that as N be-comes larger in the experiments, e.g. N=50, thetranslation accuracy drops drastically. As moredocuments are retrieved, less relevant information

138

0 5 10 20 5042

42.2

42.4

42.6

42.8

43

Number of Retrieved Documents (N)

BLE

UNIST 2004

L=100L=300L=600L=1000

0 5 10 20 5041

41.2

41.4

41.6

41.8

42

Number of Retrieved Documents (N)

BLE

U

NIST 2005

L=100L=300L=600L=1000

0 5 10 20 5037.8

38

38.2

38.4

38.6

38.8

39

39.2

Number of Retrieved Documents (N)

BLE

U

NIST 2006

L=100L=300L=600L=1000

0 5 10 20 5031

31.2

31.4

31.6

31.8

32

Number of Retrieved Documents (N)

BLE

U

NIST 2008

L=100L=300L=600L=1000

Figure 3: End-to-end translation results (BLEU%) using all standard and topic-related features, withdifferent settings on the number of retrieved documents N and the length of hidden layers L.

is also used to train the neural networks. Irrel-evant documents bring so many unrelated topicwords hence degrade neural network learning per-formance.

Another important factor is the length of hid-den layers L in the network. In deep learning, thisparameter is often empirically tuned with humanefforts. As shown in Figure 3, the translation accu-racy is better when L is relatively small. Actually,there is no obvious distinction of the performancewhen L is less than 600. However, when L equals1,000, the translation accuracy is inferior to othersettings. The main reason is that parameters inthe neural networks are too many to be effectivelytrained. As we know when L=1000, there are atotal of 100, 000× 1, 000 parameters between thelinear and non-linear layers in the network. Lim-ited training data prevents the model from gettingclose to the global optimum. Therefore, the modelis likely to fall in local optima and lead to unac-ceptable representations.

4.4 Effect of topic related features

We evaluate the performance of adding new topic-related features to the log-linear model and com-pare the translation accuracy with the method in(Xiao et al., 2012). To make different methodscomparable, we set the dimension of topic rep-resentation as 100 for all settings. This takes 10

hours in pre-training phase and 22 hours in fine-tuning phase. Table 2 shows how the accuracy isimproved with more features added. The resultsconfirm that topic information is indispensable forSMT since both (Xiao et al., 2012) and our neuralnetwork based method significantly outperformsthe baseline system. Our method improves 0.86BLEU points at most and 0.76 BLEU points onaverage over the baseline. We observe that source-side similarity is more effective than target-sidesimilarity, but their contributions are cumulative.This proves that bilingually induced topic repre-sentation with neural network helps the SMT sys-tem disambiguate translation candidates. Further-more, rule sensitivity features improve SMT per-formance compared with only using similarity fea-tures. Because topic-specific rules usually have alarger sensitivity score, they can beat general ruleswhen they obtain the same similarity score againstthe input sentence. Finally, when all new fea-tures are integrated, the performance is the best,preforming substantially better than (Xiao et al.,2012) with 0.39 BLEU points on average.

It is worth mentioning that the performanceof (Xiao et al., 2012) is similar to the settingswith N=1 and L=100 in Figure 3. This is notsimply coincidence since we can interpret theirapproach as a special case in our neural net-work method: when a parallel sentence pair has

139

Settings NIST 2004 NIST 2005 NIST 2006 NIST 2008 AverageBaseline 42.25 41.21 38.05 31.16 38.17(Xiao et al., 2012) 42.58 41.61 38.39 31.58 38.54Sim(Src) 42.51 41.55 38.53 31.57 38.54Sim(Trg) 42.43 41.48 38.4 31.49 38.45Sim(Src+Trg) 42.7 41.66 38.66 31.66 38.67Sim(Src+Trg)+Sen(Src) 42.77 41.81 38.85 31.73 38.79Sim(Src+Trg)+Sen(Trg) 42.85 41.79 38.76 31.7 38.78Sim(Src+Trg)+Sen(Src+Trg) 42.95 41.97 38.91 31.88 38.93

Table 2: Effectiveness of different features in BLEU% (p < 0.05), with N=10 and L=100. “Sim”denotes the rule similarity feature and “Sen” denotes rule sensitivity feature. “Src” and “Trg” meansutilizing source-side/target-side rule topic vectors to calculate similarity or sensitivity, respectively. The“Average” setting is the averaged result of four datasets.

document-level information, that document willbe retrieved for training; otherwise, the most rel-evant document will be retrieved from the mono-lingual data. Therefore, our method can be viewedas a more general framework than previous LDA-based approaches.

4.5 Discussion

In this section, we give a case study to explainwhy our method works. An example of transla-tion rule disambiguation for a sentence from theNIST 2005 dataset is shown in Figure 4. We findthat the topic of this sentence is about “rescue af-ter a natural disaster”. Under this topic, the Chi-nese rule “发送 X” should be translated to “de-liver X” or “distribute X”. However, the baselinesystem prefers “send X” rather than those two can-didates. Although the translation probability of“send X” is much higher, it is inappropriate in thiscontext since it is usually used in IT texts. Forexample, 〈发送邮件, send emails〉, 〈发送信息,send messages〉 and 〈发送数据, send data〉. Incontrast, with our neural network based approach,the learned topic distributions of “deliver X” or“distribute X” are more similar with the input sen-tence than “send X”, which is shown in Figure 4.The similarity scores indicate that “deliver X” and“distribute X” are more appropriate to translate thesentence. Therefore, adding topic-related featuresis able to keep the topic consistency and substan-tially improve the translation accuracy.

5 Related Work

Topic modeling was first leveraged to improveSMT performance in (Zhao and Xing, 2006; Zhaoand Xing, 2007). They proposed a bilingualtopical admixture approach for word alignmentand assumed that each word-pair follows a topic-

specific model. They reported extensive empir-ical analysis and improved word alignment ac-curacy as well as translation quality. Follow-ing this work, (Xiao et al., 2012) extended topic-specific lexicon translation models to hierarchicalphrase-based translation models, where the topicinformation of synchronous rules was directly in-ferred with the help of document-level informa-tion. Experiments show that their approach notonly achieved better translation performance butalso provided a faster decoding speed comparedwith previous lexicon-based LDA methods.

Another direction of approaches leveraged topicmodeling techniques for domain adaptation. Tamet al. (2007) used bilingual LSA to learn latenttopic distributions across different languages andenforce one-to-one topic correspondence duringmodel training. They incorporated the bilingualtopic information into language model adaptationand lexicon translation model adaptation, achiev-ing significant improvements in the large-scaleevaluation. (Su et al., 2012) investigated the rela-tionship between out-of-domain bilingual data andin-domain monolingual data via topic mappingusing HTMM methods. They estimated phrase-topic distributions in translation model adaptationand generated better translation quality. Recently,Chen et al. (2013) proposed using vector spacemodel for adaptation where genre resemblance isleveraged to improve translation accuracy. Wealso investigated multi-domain adaptation whereexplicit topic information is used to train domainspecific models (Cui et al., 2013).

Generally, most previous research has leveragedconventional topic modeling techniques such asLDA or HTMM. In our work, a novel neural net-work based approach is proposed to infer topicrepresentations for parallel data. The advantage of

140

Src 联合国儿童基金会也开始发送基本医药包Ref (1) the united nations children’s fund has also begun delivering basic medical kits

(2) the unicef has also started to distribute basic medical kits(3) the united nations children’s fund has also begun distributing basic medical kits(4) the united nations children’s fund has begun delivering basic medical kits

Baseline the united nations children’s fund began to send basic medical kitsOurs the united nations children’s fund has begun to distribute basic medical kits

Table 4:

Acknowledgments

The acknowledgments should go immediately be-fore the references. Do not number the acknowl-edgments section. Do not include this sectionwhen submitting your paper for review.

ReferencesYoshua Bengio, Pascal Lamblin, Dan Popovici, and

Hugo Larochelle. 2006. Greedy layer-wise train-ing of deep networks. In B. Scholkopf, J. Platt, andT. Hoffman, editors, Advances in Neural Informa-tion Processing Systems 19, pages 153–160. MITPress, Cambridge, MA.

Yoshua Bengio. 2009. Learning deep architectures forai. Found. Trends Mach. Learn., 2(1):1–127, Jan-uary.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, March.

Marine Carpuat and Dekai Wu. 2007. Context-dependent phrasal translation lexicons for statisticalmachine translation. Proceedings of Machine Trans-lation Summit XI, pages 73–80.

David Chiang. 2007. Hierarchical phrase-based trans-lation. Computational Linguistics, 33(2):201–228.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537,November.

G. E. Dahl, Dong Yu, Li Deng, and A. Acero. 2012.Context-dependent pre-trained deep neural networksfor large-vocabulary speech recognition. Trans. Au-dio, Speech and Lang. Proc., 20(1):30–42, January.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Mach. Learn. Res.,12:2121–2159, July.

George Foster, Roland Kuhn, and Howard Johnson.2006. Phrasetable smoothing for statistical machinetranslation. In Proceedings of the 2006 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 53–61, Sydney, Australia, July. As-sociation for Computational Linguistics.

Amit Gruber, Michal Rosen-zvi, and Yair Weiss. 2007.Hidden topic markov models. In In Proceedings ofArtificial Intelligence and Statistics.

Zhongjun He, Qun Liu, and Shouxun Lin. 2008. Im-proving statistical machine translation using lexical-ized rule selection. In Proceedings of the 22nd In-ternational Conference on Computational Linguis-tics (Coling 2008), pages 321–328, Manchester, UK,August. Coling 2008 Organizing Committee.

Geoffrey E. Hinton, Simon Osindero, and Yee-WhyeTeh. 2006. A fast learning algorithm for deep beliefnets. Neural Comput., 18(7):1527–1554, July.

Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for m-gram language modeling.In Acoustics, Speech, and Signal Processing, 1995.ICASSP-95., 1995 International Conference on, vol-ume 1, pages 181–184. IEEE.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Dekang Lin andDekai Wu, editors, Proceedings of EMNLP 2004,pages 388–395, Barcelona, Spain, July. Associationfor Computational Linguistics.

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.2012. Imagenet classification with deep convolu-tional neural networks. In P. Bartlett, F.C.N. Pereira,C.J.C. Burges, L. Bottou, and K.Q. Weinberger, ed-itors, Advances in Neural Information ProcessingSystems 25, pages 1106–1114.

Honglak Lee, Alexis Battle, Rajat Raina, and An-drew Y. Ng. 2006. Efficient sparse coding algo-rithms. In B. Scholkopf, J. Platt, and T. Hoffman,editors, Advances in Neural Information ProcessingSystems 19, pages 801–808. MIT Press, Cambridge,MA.

Qun Liu, Zhongjun He, Yang Liu, and Shouxun Lin.2008. Maximum entropy based rule selection modelfor syntax-based statistical machine translation. InProceedings of the 2008 Conference on EmpiricalMethods in Natural Language Processing, pages89–97, Honolulu, Hawaii, October. Association forComputational Linguistics.

Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, andMaosong Sun. 2011. Plda+: Parallel latent dirichletallocation with data placement and pipeline process-ing. ACM Transactions on Intelligent Systems andTechnology, special issue on Large Scale MachineLearning. Software available at http://code.google.com/p/plda.

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1联合国儿童基金会也开始发送基本医药包

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1<发送 X , deliver X>

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1<发送 X , distribute X>

0 20 40 60 80 1000

0.02

0.04

0.06

0.08

0.1<发送 X , send X>

Settings NIST 2004 NIST 2005 NIST 2006 NIST 2008 AverageBaseline 42.25 41.21 38.05 31.16 38.17(Xiao et al., 2012) 42.58 41.61 38.39 31.58 38.54Sim(Src) 42.51 41.55 38.53 31.57 38.54Sim(Trg) 42.43 41.48 38.4 31.49 38.45Sim(Src+Trg) 42.7 41.66 38.66 31.66 38.67Sim(Src+Trg)+Sen(Src) 42.77 41.81 38.85 31.73 38.79Sim(Src+Trg)+Sen(Trg) 42.85 41.79 38.76 31.7 38.78Sim(Src+Trg)+Sen(Src+Trg) 42.95 41.97 38.91 31.88 38.93

Table 2: Effectiveness of different features in BLEU% (p < 0.05), with N = 10 and L = 100. ”Sim”denotes the rule similarity feature and ”Sen” denotes rule sensitivity feature. ”Src” and ”Trg” meansutilizing source-side/target-side rule topic vectors to calculate similarity or sensitivity, respectively. The”Average” setting is the averaged results of four datasets.

pared with only using similarity features. Becausetopic-specific rules usually have a larger sensitiv-ity score, they can beat general rules when theyobtain the same similarity score against the inputsentence. Finally, when all new features are in-tegrated, the performance is the best, preformingsubstantially better than (Xiao et al., 2012) with0.39 BLEU points on average.

One interesting observation is, the performanceof (Xiao et al., 2012) is quite similar to the set-tings with N = 1 and L = 100 in Figure 3. Thisis not simply coincidence since we can interprettheir approach as a special case in our neural net-work method. When a parallel sentence pair hasdocument-level information, that document willbe retrieved for training. Otherwise, the most sim-ilar document will be obtained from the monolin-gual data. Our method can be viewed as a moregeneral framework than previous LDA-based ap-proaches.

4.5 Discussion

In our experiments,In previous LDA-based method, if a document

Doc contains M sentences, all M sentences willshare the same topic distribution of Doc. Al-though different sentences may express slightlydifferent implications and the topic will change,the conventional LDA-based approach does nottake the topic transition into consideration. In con-trast, our approach directly learns the topic rep-resentation with an abundancy of related docu-ments. In additional to the original document fromwhich the sentence is extracted, the IR methodalso retrieves other relevant documents which pro-vide complementary topic information. Therefore,the topic representations learned are more fine-grained and thus more accurate.

Rules P (γ|α) Sim(zs, zα)〈发送 X, deliver X〉 0.0237 0.8469〈发送 X, distribute X〉 0.0546 0.8268〈发送 X, send X〉 0.2464 0.6119

Table 3: Development and testing data used in theexperiments.

5 Related Work

Topic modeling was first leveraged to improveSMT performance in (Zhao and Xing, 2006; Zhaoand Xing, 2007). They proposed a bilingualtopical admixture approach for word alignmentand assumed that each word-pair follows a topic-specific model. They reported extensive empir-ical analysis and improved word alignment ac-curacy as well as translation quality. Follow-ing this work, (Xiao et al., 2012) extended topic-specific lexicon translation models to hierarchicalphrase-based translation models, where the topicinformation of synchronous rules was directly in-ferred with the help of document-level informa-tion. Experiments show that their approach notonly achieved better translation performance butalso provided a faster decoding speed comparedwith previous lexicon-based methods.

Another direction of approaches leveraged topicmodeling techniques for domain adaptation. Tamet al. (2007) used bilingual LSA to learn latenttopic distributions across different languages andenforce one-to-one topic correspondence duringmodel training. They incorporated the bilingualtopic information into language model adaptationand lexicon translation model adaptation, achiev-ing significant improvements in the large-scaleevaluation. (Su et al., 2012) investigated the rela-tionship between out-of-domain bilingual data andin-domain monolingual data via topic mapping us-

Figure 4: An example from the NIST 2005 dataset. We illustrate the normalized topic representations ofthe source sentence and three ambiguous synchronous rules. Details are explained in Section 4.5.

our method is that it is applicable to both sentence-level and document-level SMT, since we do notplace any restrictions on the input. In addition, ourmethod directly maximizes the similarity betweenparallel sentence pairs, which is ideal for SMT de-coding. Compared to document-level topic mod-eling which uses the topic of a document for allsentences within the document (Xiao et al., 2012),our contributions are:

• We proposed a more general approach toleveraging topic information for SMT by us-ing IR methods to get a collection of relateddocuments, regardless of whether or not doc-ument boundaries are explicitly given.

• We used neural networks to learn topic repre-sentations more accurately, with more practi-cable and scalable modeling techniques.

• We directly optimized bilingual topic simi-larity in the deep learning framework withthe help of sentence-level parallel data, sothat the learned representation could be easilyused in SMT decoding procedure.

6 Conclusion and Future Work

In this paper, we propose a neural network basedapproach to learning bilingual topic representa-tion for SMT. We enrich contexts of parallel sen-tence pairs with topic related monolingual data

and obtain a set of documents to represent sen-tences. These documents are converted to a bag-of-words input and fed into neural networks. Thelearned low-dimensional vector is used to obtainthe topic representations of synchronous rules. InSMT decoding, appropriate rules are selected tobest match source texts according to their similar-ity in the topic space. Experimental results showthat our approach is promising for SMT systems tolearn a better translation model. It is a significantimprovement over the state-of-the-art Hiero sys-tem, as well as a conventional LDA-based method.

In the future research, we will extend our neuralnetwork methods to address document-level trans-lation, where topic transition between sentences isa crucial problem to be solved. Since the transla-tion of the current sentence is usually influencedby the topic of previous sentences, we plan toleverage recurrent neural networks to model thisphenomenon, where the history translation infor-mation is naturally combined in the model.

Acknowledgments

We are grateful to the anonymous reviewers fortheir insightful comments. We also thank FeiHuang (BBN), Nan Yang, Yajuan Duan, Hong Sunand Duyu Tang for the helpful discussions. Thiswork is supported by the National Natural ScienceFoundation of China (Granted No. 61272384)

141

ReferencesYoshua Bengio, Pascal Lamblin, Dan Popovici, and

Hugo Larochelle. 2006. Greedy layer-wise train-ing of deep networks. In B. Scholkopf, J. Platt, andT. Hoffman, editors, Advances in Neural Informa-tion Processing Systems 19, pages 153–160. MITPress, Cambridge, MA.

Yoshua Bengio. 2009. Learning deep architectures forai. Found. Trends Mach. Learn., 2(1):1–127, Jan-uary.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.2003. Latent dirichlet allocation. J. Mach. Learn.Res., 3:993–1022, March.

Marine Carpuat and Dekai Wu. 2005. Word sense dis-ambiguation vs. statistical machine translation. InProceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05),pages 387–394, Ann Arbor, Michigan, June. Asso-ciation for Computational Linguistics.

Marine Carpuat and Dekai Wu. 2007. Context-dependent phrasal translation lexicons for statisticalmachine translation. Proceedings of Machine Trans-lation Summit XI, pages 73–80.

Boxing Chen, Roland Kuhn, and George Foster. 2013.Vector space model for adaptation in statistical ma-chine translation. In Proceedings of the 51st An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1285–1293, Sofia, Bulgaria, August. Association for Com-putational Linguistics.

David Chiang. 2007. Hierarchical phrase-based trans-lation. Computational Linguistics, 33(2):201–228.

Ronan Collobert, Jason Weston, Leon Bottou, MichaelKarlen, Koray Kavukcuoglu, and Pavel Kuksa.2011. Natural language processing (almost) fromscratch. J. Mach. Learn. Res., 12:2493–2537,November.

Lei Cui, Xilun Chen, Dongdong Zhang, Shujie Liu,Mu Li, and Ming Zhou. 2013. Multi-domain adap-tation for SMT using multi-task learning. In Pro-ceedings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing, pages 1055–1065, Seattle, Washington, USA, October. Associa-tion for Computational Linguistics.

George E. Dahl, Dong Yu, Li Deng, and Alex Acero.2012. Context-dependent pre-trained deep neuralnetworks for large-vocabulary speech recognition.IEEE Transactions on Audio, Speech and LanguageProcessing, 20(1):30–42, January.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Mach. Learn. Res.,12:2121–2159, July.

George Foster, Roland Kuhn, and Howard Johnson.2006. Phrasetable smoothing for statistical machinetranslation. In Proceedings of the 2006 Conferenceon Empirical Methods in Natural Language Pro-cessing, pages 53–61, Sydney, Australia, July. As-sociation for Computational Linguistics.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio.2011. Deep sparse rectifier networks. In Proceed-ings of the 14th International Conference on Arti-ficial Intelligence and Statistics. JMLR W&CP Vol-ume, volume 15, pages 315–323.

Amit Gruber, Michal Rosen-zvi, and Yair Weiss. 2007.Hidden topic markov models. In In Proceedings ofArtificial Intelligence and Statistics.

Zhongjun He, Qun Liu, and Shouxun Lin. 2008. Im-proving statistical machine translation using lexical-ized rule selection. In Proceedings of the 22nd In-ternational Conference on Computational Linguis-tics (Coling 2008), pages 321–328, Manchester, UK,August. Coling 2008 Organizing Committee.

Geoffrey E. Hinton, Simon Osindero, and Yee-WhyeTeh. 2006. A fast learning algorithm for deep beliefnets. Neural Comput., 18(7):1527–1554, July.

Reinhard Kneser and Hermann Ney. 1995. Im-proved backing-off for m-gram language modeling.In Acoustics, Speech, and Signal Processing, 1995.ICASSP-95., 1995 International Conference on, vol-ume 1, pages 181–184. IEEE.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Dekang Lin andDekai Wu, editors, Proceedings of EMNLP 2004,pages 388–395, Barcelona, Spain, July. Associationfor Computational Linguistics.

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton.2012. Imagenet classification with deep convolu-tional neural networks. In P. Bartlett, F.C.N. Pereira,C.J.C. Burges, L. Bottou, and K.Q. Weinberger, ed-itors, Advances in Neural Information ProcessingSystems 25, pages 1106–1114.

Honglak Lee, Alexis Battle, Rajat Raina, and An-drew Y. Ng. 2006. Efficient sparse coding algo-rithms. In B. Scholkopf, J. Platt, and T. Hoffman,editors, Advances in Neural Information ProcessingSystems 19, pages 801–808. MIT Press, Cambridge,MA.

Qun Liu, Zhongjun He, Yang Liu, and Shouxun Lin.2008. Maximum entropy based rule selection modelfor syntax-based statistical machine translation. InProceedings of the 2008 Conference on EmpiricalMethods in Natural Language Processing, pages89–97, Honolulu, Hawaii, October. Association forComputational Linguistics.

Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, andMaosong Sun. 2011. Plda+: Parallel latent dirichletallocation with data placement and pipeline process-ing. ACM Transactions on Intelligent Systems and

142

Technology, special issue on Large Scale MachineLearning. Software available at http://code.google.com/p/plda.

Yuval Marton and Philip Resnik. 2008. Soft syntacticconstraints for hierarchical phrased-based transla-tion. In Proceedings of ACL-08: HLT, pages 1003–1011, Columbus, Ohio, June. Association for Com-putational Linguistics.

Franz Josef Och. 2003. Minimum error rate train-ing in statistical machine translation. In Proceed-ings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 160–167, Sap-poro, Japan, July. Association for ComputationalLinguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automaticevaluation of machine translation. In Proceedingsof 40th Annual Meeting of the Association for Com-putational Linguistics, pages 311–318, Philadelphia,Pennsylvania, USA, July. Association for Computa-tional Linguistics.

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J.Williams. 1988. Neurocomputing: Foundationsof research. chapter Learning Representationsby Back-propagating Errors, pages 696–699. MITPress, Cambridge, MA, USA.

Noah A. Smith and Jason Eisner. 2005. Contrastiveestimation: Training log-linear models on unlabeleddata. In Proceedings of the 43rd Annual Meet-ing of the Association for Computational Linguis-tics (ACL’05), pages 354–362, Ann Arbor, Michi-gan, June. Association for Computational Linguis-tics.

Richard Socher, Cliff C. Lin, Andrew Y. Ng, andChristopher D. Manning. 2011a. Parsing NaturalScenes and Natural Language with Recursive NeuralNetworks. In Proceedings of the 26th InternationalConference on Machine Learning (ICML).

Richard Socher, Jeffrey Pennington, Eric H. Huang,Andrew Y. Ng, and Christopher D. Manning. 2011b.Semi-supervised recursive autoencoders for predict-ing sentiment distributions. In Proceedings of the2011 Conference on Empirical Methods in NaturalLanguage Processing, pages 151–161, Edinburgh,Scotland, UK., July. Association for ComputationalLinguistics.

Jinsong Su, Hua Wu, Haifeng Wang, Yidong Chen,Xiaodong Shi, Huailin Dong, and Qun Liu. 2012.Translation model adaptation for statistical machinetranslation with monolingual topic information. InProceedings of the 50th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 459–468, Jeju Island, Korea,July. Association for Computational Linguistics.

Yik-Cheung Tam, Ian Lane, and Tanja Schultz. 2007.Bilingual lsa-based adaptation for statistical ma-chine translation. Machine Translation, 21(4):187–207, December.

Pascal Vincent, Hugo Larochelle, Yoshua Bengio, andPierre-Antoine Manzagol. 2008. Extracting andcomposing robust features with denoising autoen-coders. In Proceedings of the 25th InternationalConference on Machine Learning, ICML ’08, pages1096–1103, New York, NY, USA. ACM.

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie,Yoshua Bengio, and Pierre-Antoine Manzagol.2010. Stacked denoising autoencoders: Learninguseful representations in a deep network with a localdenoising criterion. J. Mach. Learn. Res., 11:3371–3408, December.

Xinyan Xiao, Deyi Xiong, Min Zhang, Qun Liu, andShouxun Lin. 2012. A topic similarity model forhierarchical phrase-based translation. In Proceed-ings of the 50th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 750–758, Jeju Island, Korea, July. As-sociation for Computational Linguistics.

Deyi Xiong and Min Zhang. 2013. A topic-based co-herence model for statistical machine translation. InAAAI.

Deyi Xiong, Min Zhang, Aiti Aw, and Haizhou Li.2009. A syntax-driven bracketing model for phrase-based translation. In Proceedings of the Joint Con-ference of the 47th Annual Meeting of the ACL andthe 4th International Joint Conference on NaturalLanguage Processing of the AFNLP, pages 315–323, Suntec, Singapore, August. Association forComputational Linguistics.

Bing Zhao and Eric P. Xing. 2006. Bitam: Bilingualtopic admixture models for word alignment. In Pro-ceedings of the COLING/ACL 2006 Main Confer-ence Poster Sessions, pages 969–976, Sydney, Aus-tralia, July. Association for Computational Linguis-tics.

Bing Zhao and Eric P. Xing. 2007. Hm-bitam: Bilin-gual topic exploration, word alignment, and trans-lation. In J.C. Platt, D. Koller, Y. Singer, andS. Roweis, editors, Advances in Neural Informa-tion Processing Systems 20, pages 1689–1696. MITPress, Cambridge, MA.

143