[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - An Enhanced

An Enhanced Model for Lexical Gap Processing in English- Vietnamese Machine Translation

Tuoi Phan Thi Department of Computer Science and Computer

Engineering Ho Chi Minh City University of Technology

Ho Chi Minh City , Vietnam [email protected]

Hai Le Manh Department of Information Technology.

HUTECH University Ho Chi Minh City , Vietnam

[email protected]

Abstract—Lexical gap is a non-trivial problem in English- Vietnamese Machine translation. The gap occurs when a word in source language does not have any equivalent word in target language. Recently, some studies have found that well-processing method can improve quality of machine translation such as a model for word-to-phrase translation. However, there are still cases that the word – to – phrase model cannot pick a correct phrase to build target sentence. This study establishes a model, based on word – to – phrase translation model, which enhances quality of machine translation using Vietnamese Lexical Functional Grammar (VLFG) rule set for Vietnamese noun phrase in reconstruction algorithm.

Keywords-component; word – to - phrase translation, lexical gap, lexical functional grammar, English – Vietnamese machine translation

I. INTRODUCTION Lexical gap is a phenomenon of lacking an equivalent

word in target language to a source word. For example, an English word “abeyant” in English – Vietnamese dictionary has equivalent Vietnamese phrase “t�m th�i �ình ch�” (“t�m th�i”: temporary; “�ình ch�”: suspend). For dictionary – based machine translation, lexical gap causes structure mismatch and generates ungrammatical sentences. Hai et. al. [1] develops a word – to – phrase model for lexical gap processing. The model works in three basic steps: (1). A bilingual dictionary uses Vietnamese lexical functional grammar (VLFG) to keep target phrases in special structures; (2). A model replaces a lexical gap by an equivalent phrase structure in the dictionary and (3). A model refines target sentence structure using VLFG. However, the model successfully transfers only 82 sentences in total 200 tested sentences [1]. Many transferred sentences were rejected by target grammar rule set. This paper enhances the word-to-phrase model proposed in [1], limits Vietnamese phrase structures in bilingual dictionary and improves processing algorithms.

In this paper section 2 describes word-to-phrase transfer model and its limitations. Section 3 presents new rules for VLFG, and section 4 improves algorithms to enhance the word-to-phrase transfer model. Some experiment results are presented in section 5 and conclusions are discussed in section 6

II. A WORD-TO-PHRASE TRANSFER MODEL AND ITS LIMITATION

The model proposed in [1] has three parts: a parser, a transfer module with dictionary and a generation module (fig.1).

Figure 1. A word-to-phrase transfer model ([1]).

In Fig.1, source text is analyzed with a parser. In this

study, the parsing unit is delivered as open source OpenNLP [3]. The transfer module maps source structures to target structures and correct them using VLFG. Correct target structures and lexicon create target sentence in generation module.

Figure 2. Rejected sentence with grammar violation. A model has rejected most of incorrect structures

because of grammar constraints. Fig. 2 describes why Vietnamese sentence has been rejecting by the model [2]. In Fig.2 a violation occurs because a part in C- structure (left part) does not match to F- structure (right matrix).

TOPIC ph�i

COMMENT

RELATION

TOPIC anh

COMMENT

RELATION direct

TOPIC

COMMENT

RELATION “là”

TOPIC �i COMMENT r�i

anh �i r�i là ph�i

PR V ADV KT ADJ

VP

IP

S

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.22

105

2012 International Conference on Asian Language Processing

978-0-7695-4886-9/12 $26.00 © 2012 IEEE

DOI 10.1109/IALP.2012.22

105

The enhanced model can solve this problem by two new approaches: (1) New rule set for VLFG and (2). New algorithms for generation module.

Next section describes new rule set for VLFG.

III. NEW RULE SET FOR VLFG Vietnamese Lexical Functional Grammar has been

developed for word-to-phrase machine translation model. English-Vietnamese dictionary builds complicate matrix for each lexical gap as is fig. 3

Figure 3. An entry of English – Vietnamese dictionary New rule set for VLFG is applied with general phrase

structure rule. This rule requires every kind of Vietnamese phrase has common structure. The most important structure is Noun phrase structure. Hieu et. al. propose common rule for Vietnamese noun phrase in Table 1 [4].

TABLE 1: GENERAL STRUCTURES OF

VIETNAMESE NOUN PHRASE COMPARED TO ENGLISH.

In Table 1, parts of Vietnamese noun phrase noted as

X and Y is for English. This common structure is used in dictionary for functional component constraint. The rule

sets for Vietnamese verb and adjective/adverb phrase structure are studied but may need more time for testing.

With new rule set, the system accepts some structure from bilingual dictionary (Fig.4)

Figure 4. New structure accepted

In Fig. 4 component “will” and “be” should map to empty component in Vietnamese.

IV. NEW ALGORITHM FOR WORD-TO-PHRASE MACHINE TRANSLATION MODEL

In [1] publishes four algorithms: algorithm “Pruning component” cuts down a component from a structure to create a simpler structure; algorithm “Replacing a component” moves components and replaces simple component by complicate structure; algorithm “Adding a component” inserts a component to a structure and “Modifying a structure” changes phrase structures to others VLFG structures. Figure 5 describes algorithm “Replacing a component”. This algorithm uses “Adding a component” and “Modifying a component” to ensure Vietnamese sentence structure is correct after lexical gap processing.

Figure 5. New algorithm “Component replacement”

diagram

106106

Table 2 is an algorithm “Component replacement” - one of implementation

TABLE 2. ALGORITHM “COMPONENT

REPLACEMENT”

The example below demonstrates a lexical gap processing of the sentence “They water the buffaloes”. The word “water” has Vietnamese appropriate phrase as “cho… u�ng nc” and its structure is in figure 8. The syntactic structure of English sentence “They water the buffaloes“ is (S (NP (PRP they)) (VP (VBP water) (NP (DT the) (NNS buffaloes)))). Because “water” is lexical gap, the algorithm “Replacing a component” moves noun phrase NP “the buffaloes“ into a position in phrase structure “cho u�ng nc” (water) and outputs new structure: (S (NP(PRP they)) (VP(VBP cho(CNP(NP(DT the) (NNS buffaloes)) (VP(VBP u�ng) (NNS nc))).

Figure 6. Mapping between component structure and

functional structure of phrase “cho u�ng n��c”.

This produces grammatical Vietnamese sentence “H� cho các con trâu u�ng nc” (Fig. 6).

V. EXPERIMENT RESULTS

A. Data selection There are two sources of data: Wordnet database [6]

and Oxford English for Computing [7]. A hundreds of sentences have been randomly choosen for each source. The purpose of using two set of data is to compare nomal text and sentences in information technology domain. The data has been processed before parsing step. For example, any number must represented in character manner such as “two” instead of ”2” and so on.

B. English sentence parsing English sentence is parsed by open source OpenNL[3].

To check the validity of feature structure of English sentence, we create an unit for syntactic processing analyzing . The mapping unit maps English sentence structure to appropriate Vietnamese sentence structure. Every English word then transferred to Vietnamese word, if it existed. Lexical gaps are invoved to the model with lexical gap processing algorithm and converted to Viettnamese phrases.

C. Result evaluation

The method of evaluation is simple: output text is examinated by a lingual expert to get “pass” or “not”. Table 3 shows the ratio of the sentences containing lexical gap.

TABLE 3 RATIO OF THE SENTENCE

CONTAINING LEXICAL GAP Resource Sentences translated

by Google MT 200

Sentences contains lexical gap

120 Wordnet [11]

ratio 60% Sentence contains lexical gap

100 Oxford English for Computing [15].

Ratio 50%

In the table 3, from 200 input sentences there are 33

sentences from [6] and 27 sentences in [7] have wrong meanings in Vietnamese, caused by wrong meanign selection. Among remain sentences, 40 and 50 sentences respectively have functional structures matched VLFG rules. Table 4 shows the results of translation and lexical gap processing.

From table 4, two characteristics are recall and

precision have been calculatted. Recal is 59.7% ( 40/67) and 68% (50/73) for eache set of input data; Precision is 40% (40/100) and 50% (50/100) respective. In 90 accepted output sentences, 11 sentences (27.5%) from [6] and 10 sentences (20%) from [7] have to rebuild componentstructures. These sentences must applied the

107107

structure modifying algorithm. Some complicate sentences use the pruning a component algorithm.

TABLE 4 RESULTS OF TRANSLATION AND

LEXICAL GAP PROCESSING

Resource Sentences with lexical gap

Number Of sentence

Ratio %

sentences with wrong meaning

33 33

Sentences with wrong structure

27 27

Wordnet [11]

Sentences which are accepted

40 40

Total 100 100 sentences with wrong meaning

27 27

Sentences with wrong structure

23 23

Oxford English for Computing [15].

Sentences which are accepted

50 50

Total 100 100

6. Conclusions

The paper has represented phrase transfer model, which extended from rule - based machine translation to solve lexical gap in English – Vienamese machine translation. A bilingual dictionary is built for English – Vienamese Lexical gap entries. Although tested data is small and evaluation is simple, some improvements have been recognized.

Acknowledgements

This article is a part of our research on EVMT at Ho Chi Minh University of Technology under sponsor of Vietnam National University of Ho Chi Minh City in key project BB22001100--2200--0033TT�� ““ AA mmooddeell pprroocceessssiinngg lleexxiiccaall ggaapp ffoorr EEnngglliisshh--VViieettnnaammeessee ttrraannssllaattiioonn””.. We deeply thank Dr. Nguyen Chanh Thanh for his advice and code corrections.

Reference

[1] Le Manh Hai, Phan Thi Tuoi, “Lexical gap in English-Vietnamese machine translation: What to do ?”, IALP 2010 2010 International Conference on Asian Language Processing, Harbin, Heilongjiang, China, ISBN-13:978-0-7695-4288-1, 265-269, [2] Le Manh Hai, “A solution for lexical gap in English-Vietnamese machine translation” . PhD. thesis. Vietnam National University of Ho Chi Minh City. 2011. (in Vietnamese) [3] OpenNLP (http://incubator.apache.org/opennlp/) [4] Nguyen Chi Hieu (2008), “A model expoitting characteristics of target language to define appropriate English-Vietnamese base noun phrases, “ PhD Thesis, HCM City University of Technology , Vietnam (in Vietnamese) [5] Le Manh Hai, Phan Thi Tuoi, Nguyen Chi Hieu (2006). Dictionaries for English-Vietnamese Machine Translation, ICCPOL 2006 Singapore [6] Miller, Gerge A., (1995). “Wordnet- a lexical database for the English language” Communications of the ACM 38 , November 1995. [7] Keith Boeckner - P. Charles Brown, “Oxford English for Computing” http://www.vinabook.com/oxford-english-for computing-tieng-anh-cho-vi-tinh-m11i4259.html

108108

Documents

[IEEE 2012 International Conference on Asian Language Processing (IALP) - Hanoi, Vietnam (2012.11.13-2012.11.15)] 2012 International Conference on Asian Language Processing - An Enhanced