44
15 Chapter 1 Introduction Machine Translation (MT), also known as “automatic translation” or “mechanical translation”, is the computerized method that automates all or part of the process of translating from one human language to another. Importance of MT in the modern global world as an instrument to bridge the digital divide and its multi-disciplinary academic thrusts and absence of such system in legal domain, remain justification enough for this research work. This research work is an attempt to build a system to translate simple sentences in legal domain from Punjabi into English. The need of this system arises from the translations of the legal documents transferred from District Courts of Punjab to the High Court in Chandigarh. The FIR’s which are written in Punjabi are translated into English before presenting it to the High Court. To the best of my knowledge, no Machine Translation System is being developed from Punjabi to English in legal domain. Similar translations from some of the Indian languages to English have been developed. For Example Hinglish, a Machine Translation System for pure (standard) Hindi to pure English forms, developed by R. Mahesh K. Sinha and Anil Thakur.[1-3] This chapter introduces Machine Translation, its concepts, various approaches for Machine Translation Systems and key activities involved in it. It also provides a formal description about the research question undertaken, its objectives as

Chapter 1 Introduction - a reservoir of Indian thesesshodhganga.inflibnet.ac.in/bitstream/10603/4456/9/09_chapter 1.pdf · Chapter 1 Introduction ... legal domain from Punjabi into

  • Upload
    trananh

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

  • 15

    Chapter 1 Introduction

    Machine Translation (MT), also known as automatic translation or mechanical

    translation, is the computerized method that automates all or part of the process

    of translating from one human language to another.

    Importance of MT in the modern global world as an instrument to bridge the

    digital divide and its multi-disciplinary academic thrusts and absence of such

    system in legal domain, remain justification enough for this research work. This

    research work is an attempt to build a system to translate simple sentences in

    legal domain from Punjabi into English. The need of this system arises from the

    translations of the legal documents transferred from District Courts of Punjab to

    the High Court in Chandigarh. The FIRs which are written in Punjabi are

    translated into English before presenting it to the High Court. To the best of my

    knowledge, no Machine Translation System is being developed from Punjabi to

    English in legal domain. Similar translations from some of the Indian languages to

    English have been developed. For Example Hinglish, a Machine Translation

    System for pure (standard) Hindi to pure English forms, developed by R. Mahesh

    K. Sinha and Anil Thakur.[1-3]

    This chapter introduces Machine Translation, its concepts, various approaches

    for Machine Translation Systems and key activities involved in it. It also provides

    a formal description about the research question undertaken, its objectives as

  • 16

    well as need and scope of the study. The approach followed along with the

    reasons behind its selection to solve this research problem has been explained in

    brief. The chapter concludes by presenting major contributions of the research

    work and an outline of the study. This work is based on Gurmukhi and Roman

    scripts. The examples given in this thesis work are in Gurmukhi script along with

    their transliteration. For Example (sakar). The transliteration provided is

    based on transliteration software the GTrans, which is developed by Punjabi

    University, Patiala, Punjab, India.

    1.1 Machine Translation

    The term 'Machine Translation' (MT) refers to the computerized system

    responsible for the production of translations with or without human assistance.[4]

    Machine Translation is an application of computer and language sciences which

    helps in development of systems answering practical needs. Computer programs

    are producing translations which may not be perfect translations of literary texts,

    but produce useful translations of technical manuals, scientific documents,

    commercial prospectuses, administrative memoranda and medical reports.

    1.1.1 Some Misconceptions about Machine Translation

    There are some misconceptions about MT. It is believed that the quality of

    translation from an MT system is very poor. Such a notion has some veracity

    because no existing system can produce really perfect translations. However, this

    does not make MT useless. A rough translation would be very helpful if we have

  • 17

    a document containing very important information, written in a language which we

    do not understand. Moreover a human translator normally does not immediately

    produce a perfect translation. It is a general practice to divide the job of

    translating a document into two stages. The first stage is to produce a draft

    translation, which may not be necessarily perfect. It is then revised either by the

    same translator or by another translator with a view to improve previous

    translation. For the most part, the aim of MT is only to automate the first, draft

    translation process and to make the overall translation process fast, simple and

    cheaper.[5] MT does not threaten translators job. However, MT systems can take

    over some of the boring, repetitive translation jobs and allow human translators to

    concentrate on more interesting tasks, where their specialist skills are really

    required.

    1.1.2 History of Machine Translation

    1.1.2.1 Machine Translation of Non-Indian Languages

    The idea of mechanical dictionaries originated in the 20th century, still the origins

    of MT can be traced back to 17th century. In 1946-1947, there came the first

    tentative idea of using the newly invented computers for translating natural

    languages. Weaver received much credit for bringing the concept of MT to the

    public when he published an influential paper on using computers for translation

    in 1949. The early 1950s were a period of intense research in MT in both the

    United States and Europe. Massachusetts Institute of Technology (MIT) started

  • 18

    research in this field in 1951. In 1952 the first conference on MT was held, but it

    was not until 1954 that a translation system was demonstrated. This system was

    not that accurate but it attracted the interest of researchers and media all over the

    world. The first journal on Mechanical Translation was published at MIT in 1953.

    In 1959, IBM installed an MT system for the United States Air Force, followed by

    Georgetown University installing systems at Euratom and the United States

    Atomic Energy Agency. Despite some success of early MT systems, MT research

    funding was on the verge of serious reduction. The growing dissatisfaction of

    research sponsors caused the United States National Academy of Sciences to

    set up the Automatic Language Processing Advisory Committee (ALPAC) in

    1966. ALPAC, whose members were the major sponsors of current MT research

    projects, was to evaluate the effectiveness, costs and potential future progress of

    MT. The level of global MT activity probably reached, if not exceeded, the highest

    levels during the mid 1960 at the time of the ALPAC report. 1976 marked a

    positive turning point for MT research Canada made public their METEO System,

    which translated weather forecasts. Later that year, the European Commission

    purchased SYSTRAN, a Russian-English system. MT interest and activity has

    increased ever since, and MT has been established as a legitimate field of

    research. In 1970, there was first Doctoral thesis in MT of Anthony G. Oettinger,

    which carried the study for a Russian mechanical dictionary. The largest growth

    area has been in the marketing and sale of commercial MT systems, many for

    personal computers, and in the provision of MT-based services. The picture in

  • 19

    MT research also changed in the early 1980, however it still concentrated largely

    in well-established projects at universities (Grenoble, Saarbrcken, Montral,

    Texas, Kyoto, and the Eurotra project) and in connection with systems such as

    Systran, Logos, ALPS and Weidner. It was perhaps these systems, however

    crude in terms of linguistic quality, which more than anything else alerted the

    translation profession to the possibilities of exploiting the increasing

    sophistication of computers in the service of translation.. The 1990 showed MT

    implemented as an online service. The 2000 have shown even more research

    into MT and many new, efficient hybrid algorithms.[6-19]

    1.1.2.2 Machine Translation in India

    The earliest efforts in Machine Translation in India date from the mid 80s and

    early 90s. The prominent among these efforts are the research and development

    projects at Indian Institute of Technology Kanpur, University of Hyderabad,

    National Center for Software Technology, Mumbai and Center for Development

    of Advanced Computing (CDAC), Pune. [20] Since the mid and late 90, a few

    more projects have been initiatedat Indian Institute of Technology Bombay,

    International Institute of Information Technology Hyderabad, Anna University

    KB Chandrasekhar Research Center Chennai and Jadavpur University, Kolkata.

    There are also a couple of efforts from the private sector - from Super Infosoft

    Private Limited, and more recently, the IBM India Research Laboratory.of IT,

    Ministry of Communications and Information Technology, Government of India,

  • 20

    has played an instrumental role by funding the projects. Indian Languages (TDIL)

    program of the Ministry of Information Technology (MIT) and also the UNDP.

    University Grants Commission (UGC) also started supporting minor and major

    research projects involving development of linguistic parsers and Machine

    Translation Systems. Indian Institutes of Technology (IITs), Indian Institutes of

    Information Technology (IIITs), Centre for Development of Advanced Computing

    (C-DAC), Indian Institute of Science (IIS), Indian Statistical Institute (ISI),

    Jawaharlal Nehru University (JNU), Mahatma Gandhi International Hindi

    University (MGIHU) and other institutes have significant contributions in this field.

    The private enterprises like Tata Institute of Fundamental Research (TIFR), Tata

    Consultancy Services (TCS) have also funded Indian language technology

    research and development.[21,22]

    IIT Guwahati, CDAC Kolkata, Jawaharlal Nehru University (JNU) New Delhi are

    also involved in developing the Machine Translation Systems for different Indian

    languages.[20] Punjabi University Patiala has also entered into the field of

    Machine Translation and successfully developed Hindi-Punjabi Machine

    Translation System and vice versa. Thapar University Patiala is also working on

    UNL based Machine Translation System.[22, 23]

    1.1.3 Approaches used in Machine Translation

    Broadly classifying, approaches used for translation are Rule based and Corpus

    based. Rule based approach is further classified to direct, interlingual and

  • 21

    transfer based approach. The direct translation approach was typical of the "first

    generation" of MT systems. The indirect approach of interlingua and transfer

    based systems are meant to characterise the "second generation" of MT

    systems. Interlingua and transfer approaches are essentially based on the

    specification of rules (for morphology, syntax, lexical selection, semantic analysis,

    and generation). During the last few years, there is beginning of emergence of

    third generation" of MT systems depending upon knowledge-based, corpus-

    based and hybrid approaches. Corpus-based methods do not rely on external

    knowledge sources such as machine readable dictionaries, concept hierarchies,

    or sense-tagged texts. They do not assign sense tags to words, rather, rely on

    monolingual corpora and methods based on translational equivalence as found in

    word-aligned parallel corpora.[25, 26]

    1.1.3.1 Rule Based Approaches

    The assumption of rule-based MT is, that translation is a process which requires

    the analysis and representation of the 'meaning' of source language texts and the

    generation of equivalent target language texts. Representations should be

    unambiguous lexically and structurally. There are three major approaches: (a)

    The Direct approach in which the word to word translation is performed without

    analyzing it structurally. (b) The 'Transfer' approach in which the translation

    process operates in three stages Analysis into Abstract Source Language

    representations, Transfer into Abstract Target Language representations, and

  • 22

    Generation or Synthesis into Target Language Texts and (c) the two-stage

    'Interlingua' model, where analysis is into some language-neutral representation

    and generation starts from this interlingua representation.[27]

    1.1.3.1.1 Direct Approach

    Direct translation is the oldest approach of MT. If the MT system uses direct

    translation, it usually means that the source language text will not be analyzed

    structurally beyond morphology. The translation is based on large dictionaries

    and word-by-word translation with some simple grammatical adjustments e.g. on

    word order and morphology. The translation unit of the approach is usually a

    word. The lexicon is normally conceived of as the repository of word-specific

    information. Traditional lexical resources are machine readable dictionaries that

    contain list of words. These lists might delineate senses of a word, represent the

    meaning of a word, or specify the syntactic frames in which a word can appear,

    but the level of granularity with which they are concerned is the individual word.

    One of the oldest still used MT systems today, Systran, is basically a direct

    translation system. The first version of it was published in 1969. Over the years

    the system has been developed quite a lot, but still its translation capability is

    mainly based on very large bilingual dictionaries. No general linguistic theory or

    parsing principles are necessarily present for direct translation to work; these

    systems depend instead on well developed dictionaries, morphological analysis,

  • 23

    and text processing software.[28] Figure 1.1 shows the block diagram of Direct

    MT System.

    Figure 1.1 Direct MT System

    E.g. Direct translation from Punjabi to English is

    P:

    T: pach ui

    E: Bird flew

    The disadvantage of direct method is that, it is unidirectional, i. e., if the target is

    to be translated back into the source language, a different transformation must be

    used. It uses n2 translation modules for translations among n languages, thus

    making it exponentially large for multi-language translating system. Other

    problem with the direct method arises if the structure of sentence is complex, it

    requires complex grammatical analysis and word ordering in the target language

    sentence can often be wrong. Additionally, if lexical ambiguity exists, incorrect

    translation of words occurs.[28] Analysis of relations between different parts of

    the sentence is often lacking, which can lead to poor translation. Direct

    translation is very inaccurate for languages with structural and lexical differences.

    A direct translation system is used for a closely related language pair. Eg

    Punjabi-Hindi translation or vice versa

    Analysis of

    Language X

    Synthesis of

    Language Y Language X Language Y

  • 24

    Georgetown Automatic Translation (GAT) System developed by Georgetown

    University, used direct approach for translating Russian texts (mainly from

    physics and organic chemistry) to English.[29] The Mark II is also a direct

    translation approach based Russian to English MT System for U.S. Air Force.[30]

    RUSLAN is a direct Machine Translation System between closely related

    languages, Czech and Russian.[31] SYSTRAN is a direct Machine Translation

    System developed by Huchins and Somers. The system was originally built for

    English-Russian Language Pair. [32] In India, the team of G. S. Lehal, G. S.

    Joshan and Vishal Goyal at Punjabi University Patiala developed online Hindi-

    Punjabi and Punjabi-Hindi Machine Translation Systems using direct translation

    approach.[33]

    1.1.3.1.2 Transfer Approach

    When the major shortcomings of direct translation were realized, researchers

    started working on the transfer method. It occupies the level above direct

    translation in the MT pyramid and is also known as indirect or Linguistic

    Knowledge (LK) translation. With the Transformer architecture or Direct MT, the

    translation process relies on some knowledge of the source language and some

    knowledge about how to transform partly analysed source sentences into strings

    that look like target language sentences. With the Transfer Based architecture,

    on the other hand, translation relies on extensive knowledge of both the source

    and the target languages and of the relationships between analysed sentences in

  • 25

    both languages. It requires linguistic knowledge of the source and target

    languages as well as the differences between them. The transfer method first

    parses the sentence of the source language in the Analysis stage and then

    applies rules that map the grammatical segments of the source sentence to a

    representation in the target language in Transfer Stage and finally send in the

    Synthesis stage where the target language sentences are generated.[5]

    In this approach, the software attempts to deconstruct the grammar of the

    input language to build a grammatical model of each sentence. The grammatical

    model of the input language is then mapped to the grammatical model of the

    output language. Transfer systems divide translation into steps which clearly

    differentiate source language and target language parts. In the transfer approach

    only the ambiguities inherent in the language in question are tackled. Rather than

    operating in two stages, Transfer approach has three stages. The first stage

    converts texts into intermediate representations in which ambiguities have been

    resolved irrespective of any other language. Differences between languages, in

    vocabulary and structure, are handled in the intermediary transfer stage. In the

    third stage these are converted into equivalent representations of the target

    language. Analysis and generation programs are specific for particular languages

    and independent of each other.

    For Example

    P : (SNP) (ONP) (VP)

    T: (SNP) bacc (ONP) mih (VP) pasand karad han

  • 26

    E: (SNP)Children (VP)Like (ONP)Sweets

    Above example shows English to Punjabi translation of a sentence. After

    syntactically and semantically analyzing the sentence, we can easily translate a

    sentence even with different structures (SVO SOV). The transfer approach

    uses n2 transfer modules, n analysis components, and n synthesis components,

    where n is the number of languages in the translation system. Thus, one of its

    bottleneck is the sheer size of the rules needed for its implementation. Figure 1.2

    shows the block diagram of Transfer MT System.

    Figure 1.2 Transfer MT System

    The TAUM project developed at Montreal in 1970 for translation of weather

    forecast from English to French uses Syntactic Transfer System. AnglaBharati,

    an Indian system developed at IIT Kanpur under the expert guidance of RMK

    Sinha deals with Machine Translationfrom English to Indian languages, primarily

    Hindi, using a pseudo-interlingual rule-based Transfer Approach. It uses post-

    editing to resolve ambiguity/complexity. It is mainly developed for public health

    domain.[34] MaTra is another India n Langauges based Human-Assisted

    translation project for English to Indian languages based on Transfer

    Approach.[21] The MaTra lexicon approach is general-purpose, but the system

    Language X Analyzer

    Language Y Synthesizer

    Lang X-Lang Y Transfer

    Language X Language Y

    Transfer Module

  • 27

    has been applied mainly in the domains of news, annual reports and technical

    phrases. The Computer Science Department at the University of Hyderabad has

    been working on an English-Kannada MT system, using the Universal Clause

    Structure Grammar (UCSG) formalism. It is again based on transfer-based

    approach, and has been applied to the domain of government circulars. The

    Jadavpur University at Kolkata developed on a rule-based English-Hindi MAT for

    news using the Transfer Approach.[21]

    1.1.3.1.3 Interlingual Approach

    The Interlingual Approach was historically the next step in the development of

    MT. In an Interlingual based MT approach translation is done via an intermediary

    (semantic) representation of the source language text. Interlingua is supposed to

    be a language independent representation from which translations can be

    generated to different target languages. The interlingua approach assumes that it

    is possible to convert source texts into representations common to more than one

    language. From such interlingual representations texts are generated into other

    languages. Translation is thus in two stages: from the source language (SL) to

    the interlingua (IL) and from the IL to the target language (TL). Programs for

    analysis are independent from programs for generation, in a multilingual

    configuration, any analysis program can be linked to any generation program.

    Procedures for SL analysis are intended to be SL-specific and not oriented to any

    particular TL, likewise programs for TL synthesis are TL-specific and not

  • 28

    designed for input from particular SLs. Translation from and into n languages

    requires n(n-1) bilingual 'direct translation' systems; but with translation via an

    interlingua just 2n interlingual programs are needed. With more than three

    languages the interlingua approach is claimed to be more economic. On the other

    hand, the complexity of the interlingua itself is greatly increased. Perhaps then

    "Machine Translation" is not an appropriate term, since the machine only

    completes the first stage of the process. It would be more accurate to talk of a

    tool that aids the translation process, rather than an independent translation

    system.[27] Figure 1.3 shows the block diagram of Interlingual MT System.

    Fig 1.3 Interlingual MT System

    Eg. Interlingual representation of sentence in Universal Networking Language is

    P:

    T: pach ui

    UNL: agt ( , (icl>bird))

    There are a few problems with the interlingual approach. The interlingual

    approach requires an analyzer for each source language and a generator for

    each target language. Analysis of source text requires a deep semantic analysis

    that requires extensive world knowledge. Unfortunately, the true meaning of a

    sentence cannot always be extracted. Additionally, if a text is analyzed as deeply

    Language X Analyzer

    Language Y Synthesizer

    Interlingua Language X Language Y

  • 29

    as is expected, then much of the source authors style will be lost. A further

    problem is that using an interlingua in MT can lead to extra, unnecessary work, in

    some cases.

    University of Texas during the 1970s developed METAL system for German and

    English using interlingua approach.[35] At the end of the 1990s, the Institute of

    Advanced Studies of the United Nations University, Tokyo began its multinational

    interlingua based MT project based on a standardized intermediary language,

    Universal Networking Language (UNL). The UNL is an international project of the

    United Nations University, with an aim to create an Interlingua for all major

    human languages. It was initially for the six official languages of the United

    Nations and then for other widely spoken languages, namely, Hindi, Arabic,

    Chinese, English, French, German, Indonesian, Italian, Japanese, Portuguese,

    Russian, and Spanish.[35] IIT, Bombay is one of the members of the team

    responsible for developing UNL models for Hindi. The AnglaBharti system

    (developed at IIT Kanpur) uses a pseudo-interlingua approach to analyze English

    only once and creates an intermediate structure called Pseudo Lingua for Indian

    Language (PLIL) instead of designing translators for English to each Indian

    language. The PLIL structure can be converted to each Indian language through

    a process of text-generation. The idea of using PLIL is primarily to exploit

    structural similarity among the Indian languages to obtain advantages similar to

    that of using interlingua approach.[36] A team at Thapar University, Patiala is

  • 30

    working on Punjabi Language Server which includes Punjabi-UNL Enconverter

    and UNL-Punjabi Deconverter. [33]

    1.1.3.2 Data Driven Approaches

    Most recently, corpus-based methods have changed the traditional picture.

    Corpus-based methods of word sense discrimination are knowledge-lean, and do

    not rely on external knowledge sources such as machine readable dictionaries,

    concept hierarchies, or sense-tagged text. They do not assign sense tags to

    words; rather, they discriminate among word meanings based on information

    found in unannotated corpora. It relies on methods based on translational

    equivalence as found in word-aligned parallel corpora. Corpus-based approaches

    to Machine Translation partially succeeded to replace traditional rule-based

    approaches.[26] The main advantage of corpus-based Machine Translation

    Systems is that these are self-customizing in the sense that they can learn the

    translations of terminology and even stylistic phrasing from previously translated

    materials.

    1.1.3.2.1 Knowledge-Based MT

    Knowledge-Based MT (KBMT) is used to fill the gaps between the two extremes

    of human-only and machine-only translations. It provides high-quality translation

    much faster and at much lower cost. It is the combination of tightly integrated

    translation technologies with unique translation processes driven by highly skilled

    linguists. The objective of knowledge-based translation is to capture as much as

  • 31

    possible of linguists knowledge into the translation systems knowledge base.

    For this, the system takes the use of source and target language dictionaries,

    source and target language structures and rules, word meanings in different

    contexts and language constructs, domain specific terminology, previously

    translated words, phrases, sentences, paragraphs, language style and cultural

    differences etc. By capturing all these knowledge sources it produces the high

    quality output. Figure 1.4 shows Knowledge Based Machine Translation System.

    Figure 1.4: KBMT Representation

    This model does not require total understanding of the source text but assumes

    that an interpretation engine can achieve successful translation into several

    languages. KBMT is implemented on the interlingual architecture; it differs from

    other interlingual techniques in the extent of depth to which it will analyze the

    source language and its reliance on explicit knowledge of the world.[25] Figure

    1.4 gives the graphical representation of KBMT System.

    The KANTOO project is an object-oriented C++ implementation of KANT

    technology for Machine Translation. The KANTOO is designed to be a more

    robust, efficient and maintainable version of KANT for commercial customers.

    Language X Analyzer

    Language Y Synthesizer

    Knowledge Representation

    Language X Language Y

    Augmentor

  • 32

    Besides Analyzer and Generator, KANTOO has an integrated set of support tools

    for efficient knowledge maintenance. LUTE project at NTT and ETL research, a

    Japenese multilingiual project also applied knowledge based approach.[27, 28]

    1.1.3.2.2 Example-Based MT

    This method was proposed in 1981 and Distributed Language Translation (DLT)

    System of Japan is based on this approach. The example-based approach was

    founded on processes of extracting and selecting equivalent phrases or word

    groups from a databank of parallel bilingual texts, which have been aligned either

    by statistical methods or by more traditional rule-based methods. The main

    advantage of the approach (in comparison with rule-based approach) is that,

    there is an assurance that the results will be accurate and idiomatic, since the

    texts have been extracted from databanks of actual translations produced by

    professional translators.[12]

    Figure 1.5 EBMT Representation

    The idea behind Example-Based MT (EBMT) is to translate a sentence using

    previously analyzed examples of similar sentences. A database of previously

    Language X Analyzer

    Language Y Synthesizer

    Matching Expression Conversion

    Language X Language Y

    Translation Memory

  • 33

    analyzed text is stored in the Translation Memory (TMEM) as shown in Figure

    1.5. TMEM enables translators to store original texts and their translated

    versions side by side, so that corresponding sentences of the source and target

    are aligned. The translator can thus search for phrases or full sentences written

    in one language from translation memory and is able to display corresponding

    phrases in the other language, which matches exactly or approximately with the

    previous language. Ideally, it will find an exact structural match for the source

    sentence and replace the source word with the target words. However, it is often

    the case that there is no exact match for a source sentence. In this case, the

    system will chunk the source sentence and try to find a match in the example

    database.

    AnglaBharti system developed at IIT Kanpur, uses example-base to

    identify noun and verb phrases to resolve their ambiguities.[36] AnglaBharti-II

    launched in 2004, addresses many of the shortcomings of the earlier

    architecture. It uses example-based approach to eliminate the difficulties in

    making the modification of the rule-base.[37] AnglaHindi, another system

    developed at IIT Kanpur, besides using all the modules of AnglaBharti, also

    makes use of an abstracted example-base for translating frequently encountered

    noun phrases and verb phrases. In AnglaHindi, the example-based approach is

    invoked before the rule-based approach. The example-base is statistically

    derived from the corpus.[34]

  • 34

    1.1.3.2.3 Statistical approach

    It is relatively a new method and its strategies are based upon statistical

    approaches. Here, statistical methods are used as the means of analysis and

    generation; no linguistic rules are applied.[25] The essence of this method lies in

    aligning phrases, word groups and individual words of the parallel texts, and in

    calculating the probabilities that any one word in a sentence of one language

    corresponds to a word or words in the translated sentence. The Statistical Based

    MT has given more acceptable results by picking those word(s) from the given

    surrounding words which have the highest probability of occupying its current

    position. Here, the MT engine is trained based on large volumes of existing

    content and its translation known as "bilingual text corpora." The MT engine uses

    the large volumes of data to create statistical rules. These rules determine the

    appropriate selection based on the probability of correct translation of given word,

    phrase, or sentence of a language. Large volumes of electronic text of similar

    content are required to get the best quality output from the MT engine. By the

    turn of the century, this newer approach based on statistical models where a

    word or phrase is translated to one of a number of possibilities based on the

    probability of correct translation has achieved marked success. The best

    examples substantially outperform rule-based systems. Statistics-based Machine

    Translation (SMT) also may prove easier and less expensive to expand, if the

    system can be taught new knowledge domains or languages by giving it large

    samples of existing human-translated texts. Despite some success, however,

  • 35

    severe problems still exist i.e. outputs are often ungrammatical and the quality

    and accuracy of translation falls well below that of a human linguist. Statistical-

    Based MT (SBMT) includes some statistical techniques such as n-gram

    modeling, maximum entropy modeling, and decision tree modeling. All pure

    SBMT systems derive data from corpora that it has previously analyzed and do

    not rely on linguistic information. SBMT methods select the best representation

    choice based on Bayes theorem: argmaxw P(w|s) .SBMT will pick the word (w)

    that has the highest probability of occupying its current position, given the

    surrounding words.

    RAND Corporation undertook statistical analyses of a large corpus of Russian

    Physics texts, to extract bilingual glossaries and grammatical information. The

    IBM India Research Lab at New Delhi has recently initiated work on statistical MT

    between English and Indian languages, building on IBMs existing work on

    statistical MT.[21] Google language translator also uses statistical approach for

    translation. Microsoft Bing Translator allows users to translate texts or entire web

    pages into different languages. This translation service is also using Statistical

    Machine Translation strategy to some extent.

    1.1.3.3 Hybrid Approach

    During the last few years, the "third generation" of hybrid systems combining the

    rule-based approaches of the earlier types and the more recent corpus-based

    methods have also emerged. Hybrid methods are still fundamentally statistics-

  • 36

    based, but incorporate higher level abstract syntax rules to arrive at the final

    translation.

    An Interactive Japanese to English Translation System was introduced to support

    non-natives of English to write English material, uses hybrid approach for

    translation. Turkish to English Machine Translation System is a Hybrid Machine

    Translation System by combining two different approaches to MT. The Hybrid

    Approach transfers a Turkish sentence to all of its possible English translations,

    using a set of manually written transfer rules. Then, it uses a probabilistic

    language model to pick the most probable translation out of this set.. SisHiTra

    developed by Gonzalez et. al is a also hybrid Machine Translation System from

    Spanish to Catalan. This project tried to combine knowledge-based and corpus-

    based techniques to produce a Spanish-to-Catalan Machine Translation System

    with no semantic constraints. Bengali to Hindi Machine Translation System

    developed at IIT Kharagpur also uses Hybrid Approach for translation.[38-41]

    1.1.4 Key Activities[41-45]

    Overview of common key activities, which formulate a Machine Translation

    System are described as under. These activities are usually executed in

    sequence. However, depending upon the technique being followed, one or more

    of these activities may be omitted.

    PRE-PROCESSING: This module tokenizes the input text into words based on

    the list of word boundaries. Pre-processing phase also includes filtering the text.

  • 37

    Text filtering means detecting and marking certain special expressions like

    named entities, collocations etc. Another important task performed in pre-

    processing can be text normalization that includes checking the spelling

    variations and replacing it with standard spellings. Pre-processing may include

    activities to reduce the complexity of translation of source language and to

    increase the accuracy of translator.

    MORPHOLOGICAL ANALYSIS: The purpose of a morphological analyzer is to

    return root word and grammatical information about all the possible word classes

    for a given word. Morphological analysis phase also includes extraction of the

    grammatical information including number, gender and tense information for all

    the tokens. Since Indian languages have a rich inflectional morphology,

    morphological analyzer is an essential tool for such languages.

    PART OF SPEECH TAGGING: The output of the morphological analyzer is

    usually ambiguous because a single word in the source language may have

    number of tags. A particular word can be used as a noun, an adjective or a verb

    etc. Part of speech tagger disambiguates the ambiguous output of morph

    analyzer by using the contextual information in which the word is being used.

    PHRASE CHUNKING: Chunking is a way of organizing information into familiar

    groupings. Phrase chunking is a natural language process that separates and

    segment sentences into their sub constituents such as noun, verb, and

  • 38

    prepositional phrases. Typical chunks are noun phrases, prepositional phrases

    and verb phrases. Chunking works on POS tagged text, so its accuracy depends

    upon the accuracy of POS tagger. The chunker can be rule based or

    probabilistic.

    TRANSLATION AND TRANSLITERATION: All of the above activities analyse the

    given input. Having all the necessary information regarding the words in a

    sentence, the next step is to find its equivalent in the target language. The

    translation engine has two parts, Translation and Transliteration. Translation

    includes finding the word equivalent from a bilingual lexicon. Transliteration is

    writing the word in different script without interpreting. The transliteration process

    also uses a lexicon of character mappings for Source and Target language

    characters. It is used for out-of-vocabulary words and recognised named entities.

    All other words are translated.

    In the direct MT systems, source language words are simply replaced by target

    language words but in Indirect MT system, synthesizers for target language

    phrases are also needed.

    SYNTHESIS: If the source language and target language have different word

    order, this step tries to reorder the words according to the grammar of target

    language. For example, the word order in Punjabi is Subject-Object-Verb. On the

    other hand, English is Subject-Verb-Object language. According to the grammar

    of target language, some reordering techniques are required.

  • 39

    POST PROCESSING: Post processing improves the quality of the translation

    produced by the machine. The extent of requirement of post processing depends

    upon the quality of the output received. This phase improves the translation

    quality by making corrections in the generated output. Post Processor is actually

    a corrector of ill formed sentences.

    1.2 Research questions

    Presently there is no Machine Translation System available from Punjabi to

    English in legal domain; however a grammar checker and POS tagger is

    available for Punjabi language. Similar Machine Translation Systems are

    available for Indian languages to English language but these belong to different

    domains. Though systems are available from Hindi to English for the domain of

    public health, news and annual reports, but none are available Punjabi.

    The problem statement for the present research work has been formulated as

    below:

    To develop algorithms and lexical resources along with a software package

    to translate a simple sentence written in Punjabi language to English. The

    sentence should lie in legal domain and should follow a particular syntax.

    Present research study is basically to develop a Punjabi to English Machine

    Translation System for legal documents which translates a simple Punjabi

    sentence in legal domain to English. The system will be helpful to the persons

    with a little knowledge of English. They can translate sentences in Punjabi easily

  • 40

    into sentences in English without the need of any interpreter, thus removing the

    language barrier.

    1.2.1 Objectives

    The objectives of this study are:

    1. To study Punjabi and English language and their divergences.

    2. To study the inflectional morphology of Punjabi and various types of

    agreement in Punjabi sentences.

    3. To adapt the existing lexical resources such as morph database for part of

    speech tagging.

    4. To develop lexicon for collocations in Punjabi text.

    5. To develop algorithms for part of speech tagging and phrase chunking

    modules.

    6. To develop a module for finding the gender (Masculine, Feminine or Both)

    and number (Singular, Plural) information for the nouns and pronouns

    used in the sentence.

    7. To develop a module to translate tagged Punjabi words to their English

    equivalents.

    8. To develop transliteration module for handling named entities and out-of-

    vocabulary words.

    9. To develop algorithm for synthesizing translated phrases to an English

    sentence.

  • 41

    10. To develop algorithm for post processing tasks.

    11. To develop test cases for evaluating the system critically.

    1.2.2 Challenges

    MT across the languages is a challenging task for several reasons like, the

    difference in the structure of source and target languages, ambiguity, multiword

    units like idioms, phrases and tense generation and many more. Some of the

    major challenges faced in development of Punjabi to English MT system are as

    follows.

    1. Word ordering is different for Punjabi and English. In Punjabi, word

    order is Subject-Object-Verb (SOV) whereas in English, it is Subject-

    Verb-Object (SVO). Lexical differences also exist in these two

    languages as in some cases a group of words used in Punjabi has a

    single-word equivalent in English.

    2. Articles are used in English but not in Punjabi. The articles can be

    added at the time of post processing to correct the sentence in some

    cases.

    3. Lack of lexical resources such as digital bilingual dictionary, Tagged

    Corpus etc. There is no machine readable dictionary available for

    Punjabi to English which can be directly used for translation, however

    dictionaries are available to explain the meaning of a word.

    Morphological Analyzer for Punjabi developed at Punjabi University,

  • 42

    Patiala cannot be used directly into the system. However the database

    can be adapted in Punjabi to English translation system. No tagged

    corpus is available for statistical tagging. Tagged corpus has been built

    using the set of training sentences.

    4. There are multiple translations of a Punjabi word to English. It may

    depend upon the context in which the word is present in the sentence.

    5. To identify the proper nouns present in the sentence.

    6. Punjabi is free-word order language, so it was a challenging task to

    identify the phrase performing the function of subject in the sentence.

    7. There is a major challenge for development of a rule based system for

    Machine Translation. The rule which we made for a particular type of

    sentence is overruled in another type of sentence.

    8. Output of the translator needs some grammatical correctness.

    1.2.3 Need and Scope

    The need of the system arises from the translations of the legal documents

    transferred from district courts of Punjab to the High Court in Chandigarh. The

    FIR which is written in Punjabi language are translated to English before

    presenting it to the High Court. The scope of the system can be extended to

    many legal agencies where the translation from Punjabi to English is needed.

    The need for legal document translation can arise in a number of different

    situations, from the finalisation of a large international business deals, or

  • 43

    relocation of employees from one company site to another across national

    borders. Serious circumstances, such as litigations and disputes over business

    affairs taken to foreign courts, can also call for legal translations. Legal translation

    services may be required by any business or individual, though they are most

    commonly required by law offices and courts, especially for court proceedings on

    an international level. So, when lawyers deal with foreign documents, legal

    translation services are a must.

    1.2.4 Potential Use

    As on today the need of translation is much more than it was in past. It is

    undoubtedly an important topic socially, politically, commercially, scientifically,

    and intellectually (or philosophically) and one whose importance is likely to

    increase day by day. Some of the areas highlighting the importance of MT are

    briefly described below:

    The socio-political importance of MT arises in communities where more

    than one language is generally spoken. Here, the only viable alternative to

    rather widespread use of translation is the adoption of a single common

    language, which is not an attractive alternative, because it involves the

    dominance of the chosen language, to the disadvantage of speakers of the

    other languages, and raises the prospect of the other languages becoming

    second-class, and ultimately disappearing. Since the loss of a language

    often involves the disappearance of a distinctive culture, and a way of

  • 44

    thinking, this is a loss that should matter to everyone. So translation is

    necessary for communication for ordinary human interaction and for

    gathering the required information.[10] The major problem of the

    translation is that there is scarcity of human translators. Also there is a

    limit on the extent of their productivity without automation. In short, an

    automation of translation is a social and political necessity for modern

    societies, which does not impose a common language on it.

    It is also a necessity of organizations like the European Community and

    the UN, for whom multilingualism is both a basic principle and a fact of

    everyday life.

    The commercial importance of MT is summarized below:

    (a) Manual translation is less expensive. Translation is a highly skilled job,

    requiring knowledge of a number of languages, and in some countries,

    translators salaries are comparable to other highly trained

    professionals.

    (b) Machine Translation is speedy whereas Manual Translation proves

    exorbitant and sometimes it causes loss of revenue for the company. A

    professional translator translates approximately 4-6 pages of

    translation (approximately 2000 words) per day which increases the

    time period to translate product documentation. Hence the launch of

    new product gets delayed. Considering the above drawbacks of

    manual translation, Machine Translation is comparatively more

  • 45

    important in speeding up the process of translation. However, the

    output of machine translator can be further edited by human

    translators.

    Scientifically, MT is interesting, because it is an obvious application and

    testing ground for many ideas in Computer Science, Artificial Intelligence

    and Linguistics.

    Philosophically, MT is again interesting, because it represents an attempt

    to automate an activity that can require the full range of human knowledge.

    For example, getting the correct translation of negatively charged

    electrons and protons into Punjabi depends upon the knowledge of

    charge on protons, so the interpretation cannot be something like

    negatively charged electrons and negatively charged protons. In this

    sense, the extent to which one can automate translation is an indication of

    the extent to which one can automate thinking.[10]

    MT is mainly used to bridge the Digital Divide: The Internet changes the

    world very fast. We can find vast amount of information on Internet. But

    most of this information is in English. In the context of rural India, most of

    this information is effectively unavailable to the rural masses without

    having any knowledge about English language. In spite of all the progress

    that is being made in the field of Information Technology, rural masses

    remain deprived of the technological advancements. One of the primary

    reasons for this is the incapability in information distribution and language

  • 46

    barrier is one of the biggest hurdles in this information distribution. There is

    a great demand to translate Web pages and electronic mail messages into

    the native language. There is a demand of Internet-based online

    translation services.

    MT can be used to assist human translator. There is demand of online

    versions of electronic dictionaries as translation systems for helping

    human translators.

    Thus, if MT becomes more accurate and efficient enough, it can break down

    cultural barriers and make communication much easier among speakers of

    different languages.

    1.3 Assumptions

    We cannot build a fully automatic high quality Machine Translation System. It is

    even difficult to build a system for two different word order languages. ed set of

    sentences. Assumptions taken for development of this system are:

    1. If a paragraph is being input to the system, it should have proper delimiters

    for each sentence.

    2. The sentences should be simple. The system does not work for complex,

    compound, passive and interrogative sentences.

    3. The sentence given as input to the translator, should be limited to six

    phrases including verb phrase.

  • 47

    4. Word level ambiguity where a word can have number of tags with same

    grammatical category but different meanings with respect to context is not

    resolved.

    5. Abbreviations must contain a period between the characters.

    1.4 Architecture of Punjabi to English Machine Translation System

    Fig 1.6 Architecture of Punjabi to English Machine Translation System

  • 48

    1.5 Approach Applied for the System

    The approach applied for our Machine Translation System is the Rule based

    approach. Since both the languages are different word order languages and have

    number of divergence patterns, the indirect approach is best suited approach for

    such type of translations. The system is broadly classified into three phases,

    Analysis Phase, Translation Phase and the Generation Phase. Analysis phase

    consists of Pre-processing, Tokenization, Tagging and Chunking. The Translation

    phase includes Translation and Transliteration of each token and the Generation

    phase involves Synthesis and Post-Processing.

    Following is the brief introduction about the steps involved in it:

    1.5.1 Analysis Component

    The analysis component analyzes the source language text and passes the

    tagged phrases to the translation engine.

    1.5.1.1 Pre-Processing and Tokenization

    1.5.1.1.1 Tokenization

    The tokenizer takes input from the text generated by the previous module. This

    module, uses space or a punctuation mark, as delimiter, extracts tokens (words)

    from the text and gives it to the next module for further processing.

    1.5.1.1.2 Pre-Processing

    In the pre-processing phase, number of operations are applied on input

    sentences to make them processable by the translator that those can be

  • 49

    processed by the translator with better accuracy. The system performs following

    pre-processing tasks.

    (a) Text Normalization

    A small module has been developed to generate the database for

    standardization. The module finds the spelling mistakes of each word and

    replaces it with the correct word by taking it from the database. For Example. In

    the word , pairi bindi may be included after the character , it should be

    corrected by taking correct word from the database. In some words adhak is

    used with some characters, in other these are not used. A database for

    standardization has been developed by analyzing the words with the frequency of

    occurrences for variant spellings.

    (b) Identifying Collocations

    Many words in the input sentences creates problem if treated alone. In the pre-

    processing phase, we recognize those words and join the words to make it a

    single word so that it can be translated in the target language.

    P:

    T: uhd maut hds vicc h ga

    P:

    T: lak d bhl shur kar ditt ga

    In the above sentences, (h ga) is joined as (h) and (kar ditt)

    is joined as (kt)

  • 50

    Pre-processor combines the adjoining words from the sentence to a single word

    by checking them from the database of joined words.

    (c) Identifying Named Entities

    Named entity tagging refers to the task of identifying named entities (such as

    person names) in a text. It is an important subtask for information extraction and

    retrieval. The system under discussion only extracts those words which show the

    names of the persons, places etc. The extraction of such words is important so

    that these should not be translated. After recognizing these words, these are sent

    to transliteration module. To recognize that, some rules have been developed by

    checking preceding or succeeding word. For Example. the names may be

    preceded by (sr), (sardr), (sardrn,), (srmt), (misar),

    (misz), (mis) etc. or followed by (sigh), (kaur) or surname.

    1.5.1.2 Morph Analyzing and Tagging

    The next step is to tag each word with the grammatical information about it. In

    Punjabi grammar, the parts of speech include noun, verb, adjective, adverb,

    pronoun, preposition, conjunction, interjection, operators, auxiliary verbs etc. Tag

    contains the information about grammatical category of word, gender, number,

    person and the case in which it can be used. It works in two steps.

  • 51

    1.5.1.2.1 POS Tagging Morphological database already created in Punjabi University, Patiala is being

    used to get the information about each token and according to the information

    gathered from it, tags are formed.

    Tag contains grammatical category-gender-person-number-case-tense-phrase-

    type. The fields not applicable to a particular category are left blank. For

    Example. Tags for the word (d) are ipo- - - - - - - -, v-b-s-s- -f-x- -. The above

    tag for the word shows that it can be used as inflected postposition or as verb

    with any gender, singular, second person, and future tense. In Punjabi language,

    a word can have number of tags as a particular word can be used in number of

    ways. The tagger first checks the category of each word from the database and

    then adds gender, number, person or tense information to it. [19, 20] .Each word

    is attached with a number of tags. Since a particular word may have a number of

    tags, there is need to check which tag is applicable to a particular word in a

    sentence.

    1.5.1.2.2 Ambiguity Resolution between Different Tags

    A Hybrid Approach which is a combination of rule based and statistical based

    approach is used to solve ambiguity for a word with number of tags. First level of

    ambiguity exists when a particular word can have number of tags of different

    grammatical categories. The probability for the existence of a particular tag

  • 52

    should be calculated using Viterbi Algorithm by observing the frequency of

    occurrence of tags of the preceding words.

    For Example

    P:

    T: uh hds vicc zam h gi

    In the above sentence, the word (zam) has two tags, one shows that it is

    a noun and the other shows that it is an inflected adjective. In the first sentence,

    probability of occurrence for the word as a noun is more than as an adjective, so

    it is considered as a noun.

    But, if the sentence is

    P:

    T: uh n zam dm n mr ditt

    Here the word (zam) has higher probability as an adjective

    Second level of ambiguity that has been resolved is, when there are a number of

    tags that show a particular word as noun, but can be used as singular or plural.

    For Example. tag for the word (mu) is, n-m- -s-o - - -, n-m- -p-d- - - -. For

    such type of ambiguity rule based approach is used.

    The tagged word can be a noun in singular or a noun in plural.

    In the sentence,

    P:

    T: sr mu laan ga

  • 53

    In the above case the tag n-m- -p-d- - - - -, is selected as the number of verb

    phrase is plural and its appropriate word in English is boys, whereas in the case

    P:

    T: ik mu n main rki

    Here the tag for (mu) should be n-m- -s-o- - - - and its appropriate

    equivalent in English is boy. Such type of ambiguity can be resolved by

    considering the number ie. Singular or plural of the auxiliary verb or the main verb

    present in the sentence. For resolution of ambiguity, the rules are ordered

    according to priority.

    1.5.1.3 Chunking

    Chunking involves grouping of words of input sentence into phrases such as

    noun phrase, postpositional phrase and verb phrase. A rule based chunker has

    been developed for this purpose. First of all, from the Punjabi sentence, subject is

    chosen by applying the rules of subject noun phrase and then from the predicate

    other phrases are recognized.

    For Example

    P:

    T: shkat n ih savikr kt hai

    Here (shkat n) is taken as subject noun phrase and the rest of sentence

    as predicate.

  • 54

    By combination of different word classes, Noun Phrases, Adjective Phrases,

    Prepositional Phrases and Verb phrases are formed. A noun phrase consists of

    nouns or pronouns. It can be preceded by its modifiers which can be adjectives.

    An adjective phrase is a phrase with an adjective as its head. It can also consist

    of adjectives with modifiers. (bahut m ) is an adjective phrase in the

    sentence (uhd niyat bahut m s). In Punjabi language,

    preposition is called postposition as it comes after the noun or pronoun. The

    postposition and its object make up a postpositional phrase, which can be used

    to modify noun phrases. For example. in the sentence

    (uh dasv jamt d vidirth s), (dasv jamt d) is the

    prepositional phrase. Verb Phrase consists of main verb, followed by operators

    and auxiliary verb and preceded by an adverb. Operators are of four types,

    Primary operator, Passive operator, Modal operator and Progressive operator.

    These operators help to emphasize the working of main verb.[8] For

    implementation in MT System, a different database is maintained for conjunct

    verbs having their English equivalent by checking the preceding word.

    Chunking is performed using the rules of noun phrases, postpositional phrases

    and the verb phrases. The rules for division to phrases are stored in the rule base

    of Punjabi and the conversion rules are stored in the rule base of target

    language.

  • 55

    1.5.2 Translation Engine

    This component either translates or transliterates each token into target language

    equivalent converting the source language tokens to target language tokens.

    1.5.2.1 Transliteration

    The named entities recognized in the pre-processing phase and out of

    vocabulary words are given as an input to the transliteration module.

    Transliteration means to write them sensing the characters in the words. For

    Example, (manjt) in Punjabi is transliterated in English as manjeet, m for

    , n for , j for , ee for , t for . This transliteration process also uses a

    database of transliterating characters and certain rules to insert vowels wherever

    needed.

    1.5.2.2 Translation using Bilingual Dictionary

    Next step in translation is the use of a bilingual dictionary to translate each word

    in Punjabi to its English equivalent. The meanings of the words are sensed

    depending upon the morph information given in the tag attached to each word.

    1.5.3 Generation Component

    This component synthesizes the target language equivalent tokens into

    sentences and then corrects those sentences to increase the accuracy of the

    translator.

  • 56

    1.5.3.1 Synthesis

    After getting English equivalent of each word in Punjabi sentence, it should be

    synthesized first to phrases and then to the sentence using structural rules of

    English language. These rules of language are also stored in the rule base of

    English.

    1.5.3.2 Post Processing

    After converting all the source text to target text, there are some of the

    grammatical errors that need to be corrected. For this purpose, we have

    formulated the rules for correcting the grammatical errors. This Post Processing

    phase is responsible for correcting grammatical errors in the generated output.

    Some rules used for Post Editing include

    Deletion of Preposition, if present with an Adverb

    If prepositions and adverbs exist together in the translated sentence, the

    preposition is deleted

    Replace of with to if it is between a verb and a noun.

    1.6 Thesis Organization

    In the first chapter of this thesis, Machine Translation is introduced and details

    about various types of MT systems are provided. The benefits, applications, and

    challenges of Machine Translation are described. After elaborating the various

    approaches used for Machine Translation and stages in a generic MT system, a

    formal description about the research question that we intend to undertake in this

  • 57

    thesis work along with the major contribution and achievements of the research

    are provided.

    Chapter 2 discusses the existing work in the field of Machine Translation in India

    and outside India. This chapter on literature survey forms the basis of our work

    on developing the Machine Translation System and later on helps us in

    comparing our work with the existing state of the art in Machine Translation

    System.

    Chapter 3 explains Punjabi and English languages and divergences in their

    patterns with respect to Machine Translation.

    Chapter 4, 5 and 6 provide the design and implementation details of various

    activities involved in the Machine Translation System. Chapter 4 describes the

    Analysis Phase which contains Pre-processing, Morph Analyzing, Tagging and

    Chunking. Chapter 5 describes the Translation Engine and Chapter 6 discusses

    the Generation Component which includes Synthesis and Post Processing.

    Chapter 7 provides the evaluation of the system and its results.

    Chapter 8 concludes the thesis by providing a summary of the research work

    undertaken, contributions of the research work, assumptions and limitations, and

    some directions in which this work could be extended in future.

  • 58

    1.7 Summary

    In this chapter, Introduction to Machine Translation, key activities involved and

    various approaches for developing Machine Translation have been provided. It is

    followed by a formal statement for this research work along with its objectives,

    challenges involved, need and scope, and potential application areas of this

    system. Further, the approach followed to develop the Punjabi to English

    Machine Translation System has been discussed along with an overview of the

    architecture of this system. The chapter concludes with a brief outline of this

    thesis. The next chapter provides a survey of the existing literature in the field of

    Machine Translation.