36
MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING INFERENCE RULE MAISARAH BTE YAMAN A thesis submitted in partial fulfillment of the requirement for the award of the Degree of Master of Science (Software Engineering) Faculty of Computer Science and Information Technology Universiti Tun Hussein Onn Malaysia DECEMBER, 2013

MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

  • Upload
    lamtruc

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

INFERENCE RULE

MAISARAH BTE YAMAN

A thesis submitted in

partial fulfillment of the requirement for the award of the

Degree of Master of Science (Software Engineering)

Faculty of Computer Science and Information Technology

Universiti Tun Hussein Onn Malaysia

DECEMBER, 2013

Page 2: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

iv

ABSTRACT

Natural Language Processing (NLP) is a technique where a machine can understand

human better and thereby reduce the distance between human being. NLP was

implemented in MyParser, acting as a tool of parsing sentence. This research aimed

to develop a sentence parser application for teachers and learners of the Malay

language who faced difficulties in comprehending the grammar and phrase structure.

There are three major processing steps that have been drawn which are Input

Component, Parsing Engine and Output Component. Parsing engine phase involved

pre-processing phase; the use of 'Tokenizer' and 'Part Of Speech' (POS). The input

component is a simple sentence provided by an expert of Malay Language (Munsyi

Dewan). It is categorized based on the inference rule implemented in MyParser. This

study is significant in designing inference rules for Malay Language sentences,

focusing on the relationship between computing language. The output was labelled

according to each phrase following the Malay Context Free Grammar (CFG) rule.

Thus, parsing technique is an essential component to be considered in parsing

application development. Besides that, MyParser application was developed to

process Malay simple sentence by categorizing them into different grammar and

phrase structure. The target groups for this application are students and teachers of

Malay Language subject in primary school. MyParser was evaluated using 100

training data which were agreed by a qualified expert on Malay Language of

Malaysia (Munsyi Dewan). More than thousand words were stored in the database.

This application was found to be able to visualize correct sentence with its labelled

graphical tags. MyParser was tested by different level of teachers from primary and

secondary school and Munsyi Dewan. The results proved that MyParser achieved

more than 90% accuracy in constructing sentences based on its grammatical rule.

Page 3: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

v

ABSTRAK

Pemprosesan bahasa semulajadi (NLP) adalah teknik yang mana sesuatu mesin boleh

memahami kehendak manusia. NLP adalah komponen yang penting dalam proses

penguraian bahasa yang dibangunkan iaitu MyParser. Penyelidikan ini bertujuan

untuk membangunkan aplikasi penghurai untuk memproses ayat tunggal Bahasa

Melayu bagi kegunaan guru dan pelajar kelas Bahasa Melayu yang mempunyai

masalah pemahaman nahu konteks Bahasa. Terdapat tiga bahagian utama yang

terdapat dalam MyParser iaitu ‘Komponen Input’, ‘Enjin Penghurai’, dan

‘Komponen Output’. Fasa Enjin Penghurai melibatkan proses 'Penandaan' dan

‘Golongan Kata’. Komponen input adalah ayat ringkas yang diberikan oleh pakar

bahasa (Munsyi Dewan) yang dikategorikan berdasarkan peraturan inferen yang

dilaksanakan di MyParser. Kajian ini menyumbang kepada pembentukan peraturan-

peraturan untuk ayat dalam Bahasa Melayu dengan memberi tumpuan kepada

hubungan antara bahasa pengaturcaraan. Oleh itu, teknik penghuraian ini merupakan

komponen penting dalam pembangunan aplikasi ini. Selain itu, aplikasi MyParser

dibangunkan untuk memproses ayat ringkas Bahasa Melayu dengan

mengkategorikan ayat-ayat tersebut dalam struktur tatabahasa dan frasa yang

berbeza. Kumpulan sasaran aplikasi ini adalah guru-guru dan pelajar dari sekolah

rendah yang mengambil matapelajaran bahasa Melayu. Tatabahasa yang dipilih

untuk aplikasi ini ialah nahu bebas konteks (CFG) dan terhad kepada ayat tunggal

Bahasa Melayu yang mengandungi frasa-frasa ayat mudah untuk Bahasa Melayu.

'MyParser' ini dinilai menggunakan 100 data ujian yang dipersetujui oleh pakar

Bahasa Melayu (Munsyi Dewan) dan terdapat lebih daripada 1000 patah perkataan

Bahasa Melayu yang disimpan dalam rekod pangkalan data. Aplikasi ini didapati

dapat menggambarkan ayat yang betul dengan tag grafik yang dilabel.Hasil

keputusan daripada kajian yang dilakukan terhadap guru Bahasa Melayu sekolah

rendah dan menengah serta Munsyi Dewan menunjukkan 'MyParser' mencapai lebih

daripada 90% ketepatan dalam pembentukan ayat mengikut nahu bahasanya.

Page 4: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

vi

CONTENTS

TITLE i

DECLARATION ii

DEDICATION iii

ABSTRACT iv

CONTENTS vi

LIST OF TABLE viii

LIST OF FIGURE ix

CHAPTER 1 INTRODUCTION

1.1 An Overview 1

1.2 Problem Statement 3

1.3 Research Objectives 5

1.4 Aim of Study 5

1.5 Research Scope 5

1.6 Report Outline 6

Page 5: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

vi

CHAPTER 2 LITERATURE REVIEW

2.1 Introduction 7

2.2 Sentence Grammar Overview 7

2.3 The Standard Malay Language 10

2.4 The Types of Malay Language Grammar 12

2.4.1 Sentence Grammar 12

2.4.2 CFG in Malay Language 12

2.4.3 Partial discourse grammar 15

2.4.4 ‘Pola’ (Pattern) Grammar 15

2.5 Rules for Basic Malay Sentence 17

2.6 Malay Text Categorization 18

2.7 Pre-processing 19

2.7.1 Parsing technique 21

2.7.2 Error Analysis of Parsing Technique 21

2.8 Related Work 23

2.8.1 Ahmad’s Malay Parser 23

2.8.2 Juzaiddin’s Malay Parser 25

2.8.3 Comparison Among Research Works 27

2.9 Summary 28

CHAPTER 3 RESEARCH METHODOLOGY

3.1 Introduction 29

3.2 The Proposed Framework for MyParser 30

3.2.1 Input Component 32

3.2.2 MyParser (Pre-processing process) 35

3.2.3 Output Sentence 36

3.4 Summary 36

CHAPTER 4 ANALYSIS: MYPARSER

4.1 Introduction 36

4.2 Specific Requirements 39

4.2.1 Functional Requirements 39

4.2.2 Non-functional Requirements 39

4.2.3 UML Specification 40

Page 6: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

vii

4.2.3.1 Use case diagram 40

4.2.3.2 Class Diagram 49

4.2.3.3 Sequence Diagram 53

4.2.3.4 Activity Diagram 55

4.2.3.5 Test Plan 56

4.3 Summary 58

CHAPTER 5 IMPLEMENTATION OF MYPARSER

5.1 Introduction 59

5.2 The implementation of MyParser 60

5.3 Summary 67

CHAPTER 6 RESULTS AND DISCUSSION

6.1 Introduction 68

6.2 The Experimental Result of MyParser 68

6.2.1 Error Analysis of MyParser 71

6.3 Comparison with other Malay NLP 76

6.3 Summary 76

CHAPTER 7 CONCLUSION

7.1 Introduction 77

7.2 Achievement of Research Objectives 77

7.3 Recommendations for Future Work 78

7.4 Conclusion 80

REFERENCES 81

APPENDIX A 84

APPENDIX B 102

Page 7: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

viii

LIST OF TABLE

4.1 The Description of Insert Sentence 42

4.2 The Description of Edit Sentence 43

4.3 The Description View Malay CFG Rule 44

4.4 The Description Update Sentence 45

4.5 The Description Add Rule 46

4.6 The Description Update Rule 46

4.7 The Description Update New Word 47

4.8 The Description Record Data 48

4.9 The Description Class User 50

4.10 The Description Class Student 50

4.11 The Description Class Teacher 51

4.12 The Description Class Programmer 51

4.13 The Description Class Programmer 52

4.14 The Test Plan of MyParser 57

6.1 Results Of Sentence Parse 67

6.2 The Analysis Error of MyParser 69

6.3 The Analysis Of Malay NLP 69

Page 8: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

x

LIST OF FIGURE

2.1 First tree of “He saw the boy with a telescope” (Meyer et al., 2002) 8

2.2 Second tree of “He saw the boy with a telescope” (Meyer et al., 2002) 9

2.3 First tree of “Kami adang air itu 9

2.4 Second tree of “Kami adang air itu” 10

2.5 Context-Free Grammar for Malay Language (Karim,1995) 14

2.6 Pre-processing for conceptual clustering (Zakree et al., 2008) 20

2.7 The Format For Errors Analysis by Hendrickson (1979) 22

2.8 The Architecture of The Parsing For Ahmad's Malay Parser 23

2.9 The Architecture of The Juzaidin's System 25

3.1 The Proposed Framework for MyParser 30

3.2 Flowchart of MyParser processing sentence 31

4.1 Recommended SRS Structure (IEEE Std. 830-1998) 38

4.2 Use Case Diagram of MyParser 41

4.3 Class Diagram of MyParser 48

4.4 Sequence Diagram of User 53

4.5 Sequence Diagram of Programmer 55

4.6 Activity Diagram of The Users 55

4.7 Activity Diagram of The Programmer 56

5.1 The Main Page of Application 60

5.2 The Sentence Tokenization Process 61

5.3 The Algorithm Sentence Tokenization 61

5.4 The Word By Word Tokenization Process 62

5.5 The Algorithm Word for Tokenization Process 62

5.6 The Algorithm Word for POS Tagging Process 62

5.7 The Error Massage Window Of Myparser Application 63

5.8 The Inference Rule for MyParser Based on Malay CFG 64

5.9 The Message Window Of Malay CFG Structure Correct 65

Page 9: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

xi

5.10 The Sentence Visualization Structure Of Malay CFG Is Correct 66

5.11 The Sentence Visualization Structure Of Malay CFG Is Incorrect 64

6.1 The Score Of Build Correct And Incorrect Sentence 68

6.2 The Pie Chart Of Total Sentence Parsed 69

6.3 The Errors Analysis of MyParser 70

6.4 The Lexicon Analysis Error of MyParser 71

6.5 The Syntax Analysis Error of MyParser 72

6.6 The Orthography Analysis Error of MyParser 73

Page 10: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

CHAPTER 1

INTRODUCTION

1.1 Overview

Processing techniques for English texts have been developed over long periods of

time and have their own general approach to the structure of English. This cannot be

done for Malay language, partly because the structures of Malay language remains

largely improper, and partly because it is strikingly different from English (Don,

2010). The development of the database incorporates two important design

principles, namely the logical organization of data, and the separation of text

properties, lexicon, and grammar. Grammar is a formal specification of rules of

language, while parsing is a method to perform syntactic analysis where the syntactic

means the rule of the grammar. According to Abidin et al. (2007), grammar is the

rule and pattern which combined words, clause, and phrases to provide meaning to

our daily use and the study of grammar that relates to language parsing. Muhamad &

Jamaludin (2012) concluded that "In Malaysia, research in sentence parse tree

visualization for Malay language (BM) still not gaining enough attention from

researcher to construct a prototype as what been done to English language".

Recently, the progress on developing Malay Text parser for development of

simple sentences categorization has not spread widely despite the massive

development of other language in Indo-European and other Asian family language.

The growth of text processing application in English is becoming more widespread

in many different fields such as information system, natural language processing.

The use of text categorization with natural language processing approach is derived

from a combination of the important role of their relation between each other. By

considering the time and cost factor, the software requirement has played an

Page 11: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

2

important role in most of the building process of Natural Language Processing (NLP)

toolkit (Talita & Yeo, 2010). An effective NLP tool as the pre-processing component

is considered necessary in developing a tool for automatic parsing of Malay text

based.(Zakree et al., 2008). The purpose of NLP is to ensure that machines

understand the human language. In a study by Surabhi (2013), NLP is a technique

where a machine can become more human, and in that way, reducing the distance

between human being and the machine. Therefore, NLP makes human to easily

communicate with a machine. Many applications have been developed within past

few decades in NLP. Most of these are very useful for everyday life. There are also

lots of research groups working on this topic to develop more practical useful

systems.

Furthermore, the entities, properties, interrelationship, and functions derived

by graphical design notation approach are representation of knowledges intended to

capture the conceptualized information (Talita & Yeo, 2010). For English sentence

grammar, most of the language researchers introduced sentences parse tree

visualizations to help understanding sentence structure. Among the applications

introduced, phpSyntaxTree and RSyntax tree give users the chance to visualize an

English sentence throughout online dealings (Muhamad & Jamaludin, 2012).

Currently, word-processing application for Malay language only exists in the form of

words translator or spellchecker. Therefore, this research aims to complement the

existing word-processing software by presenting an application of Malay sentence

parser using structure of an expression by writing square brackets ('[' and ']') to the

left and right hand side of its component parts technique. The prototype is able to

illustrate the structure of a grammatically correct sentence, determine if a sentence is

grammatically correct, and semantically parse a sentence. In evaluating the

application, sentences which were inputs to the application, were randomly provided

by experts in the structure of Malay language grammar. If a sentence follows the

Context Free Grammar (CFG) rules for Malay language but is incorrect in terms of

semantics, it is considered to be a correct sentence. So, it enables people and

software to share a common understanding of the information’s nature and structure.

Then, the sentence structure will be categorized between the inferences rules that will

be develop in this studies.

Page 12: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

3

The pre-processing component specifications play a key role in enhancing the

potential for data sharing and reuse across the applications. Prior to any deeper

linguistic treatment of a text, the units of the text must be marked and possibly

classified. So, to categorize Malay language sentence using MyParser application, it

involved some tool in NLP where we need to develop the programming language for

the acceptance of Malay Text via the NLP tools. An important element of this

research is the structure of building this MyParser to parsing the text to build the

labelled Malay sentence.

This study contributes a design of an Inference Rule for Malay Text in the

parser which focuses on the relationship between the programming language and the

implementation of the inference rule that was created using labelled bracket notation

from left to right, top bottom parse approach (Charles, 1994). Subsequently, a parser

was developed for the simple sentence of Malay Text based on Inference Rule that

followed several Malay CFG rules. Moreover, MyParser provides the ability to full

fill the sentence required by percentage of application tested on the three test cases.

The main goal of this study is to outline the Inference Rules for Malay sentence

which is the key of MyParser to process and create a sentence with Malay CFG

grammatical label using graphical bracket labeled approach. The sample of simple

sentence was provided by Munsyi Dewan, who is the expert of Malay language.

Munshi Dewan is a speaker officially selected and established by the Dewan Bahasa

dan Pustaka (DBP) to conduct courses, talks, and seminars of the Malay language in

public and private sectors. The role of Munsyi Dewan is to provide expert services

for reference and to help DBP in upholding the use of Malay language in public and

private sectors. In this study, Munsyi Dewan plays an important role for overviewing

the inference rule used in the parser.

1.2 Problem Statement

Reading and writing are basic skills in Malay Language for primary school children.

Mastering these skills are important in order to achieve greater achievements in the

coming years. A study was conducted by Muzaliha et al. (2012) to investigate

children with learning disabilities. The targeted sample consisted of 1010 children

Page 13: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

4

with age range between 8 and 12 years old from 40 primary schools in the Kota

Bharu district. The children performance was measured via the Early Intervention

Class for Reading and Writing Screening Test conducted by the Ministry of

Education, Malaysia. The results indicated that a total of 4.8% of students had

learning and writing skills problem. They suggested that educators should identify

and treat these students at an earlier stage.

In an effort to present a tool for automatic Malay CFG application, it is

important to build the Inference Rule by following the CFG rule in Malay Text so

more precise results on the Malay grammar can be produced. However, there is still

no consensus standard in producing inference rule in Malay Text using NLP toolkits

(Zakree & Nazri, 2008). Source for information retrieval for the Malay Text

categorization using NLP is also limited.

Even though there are recent researcher overviews on Malay NLP,

information are still limited in the implementation of categorization the Malay text

using the NLP (Zakree & Nazri, 2008). To date, there is no developed system has

been designed for the processing of Malay language for sentence correction, but only

prototype is produced (Noor & Jamaludin, 2012). Furthermore, until now, parser

which processing Malay sentence for correction of wrong sentence has not yet been

established for any language, especially in Malaysia language (Noor & Jamaludin,

2012).

However, an effective parser is important in any NLP toolkits because their

performance depends on the quality of input fed into it. The contribution in this

study suggested to develop MyParser. In order to producing a better algorithm for

NLP toolkit programming language based on the inference rule used for Malay Text

Categorization.

Page 14: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

5

1.3 Research Objectives

The objectives of this research are as follows:

i. To design Inference Rule for Malay Text Categorization in the parser.

ii. To develop a parser for Malay Text Categorization based on Inference

Rule found in (i) for Malay simple sentence.

iii. To evaluate MyParser using several case studies.

1.4 Aim of the Study

The aim of this research is to concentrate on developing an application for Malay

Text categorization based on several sentence structures according to the Malay

Context Free Grammar (CFG).

1.5 Research Scope

This study focuses only on Malay Text Processing which is categorized into two

clauses which are subject and predicate. The phrase will categorize limit on noun

phrase, verb phrase, adjective phrase, and prepositional phrase. For data sampling,

the input sentences have been taken from primary school books and tagged manually

by Munsyi Dewan following the Malay language CFG. For data library limit, it has

been taken only from one source which is from Kamus Dewan Bahasa dan Pustaka

Bahasa Melayu 4th

Edition. For the techniques limit, this research focus only on NLP

toolkit and its programming language.

Page 15: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

6

1.6 Report Outline

This thesis is organized from general related knowledge to a deeper discussion,

where each chapter is the foundation of the next chapter. The rest of this thesis is

organized as follows; Chapter 1 is the introduction of the report, followed by Chapter

2 that provides the relevant background information on Malay sentences

categorization using MyParser. The discussion is then continued with the

complementary approaches for studying the Malay Text Processing Tools in

Categorization that leads to categorize the simple sentence. Afterwards, overview of

Malay Text Grammar Rule is presented. In addition, this chapter also provides a

general introduction to Malay Text Categorization using MyParser. Chapter 3

explains and illustrates research methodology used to carry out the categorization of

the Malay Text. This chapter also discusses on the method and framework that are

used in the pre-processing part that contribute to this research. Chapter 4 discusses

several software requirements that based on IEEE Std 830-1998 for SRS. This

specific requirement helps a lot in phase of development and implementation. It

helps in the description of the important document in an application and facilitating

two-way process between the developer documentation and the user. Chapter 5 is

about the implementation of MyParser based on the proposed framework. This

chapter shows how the implementation occurs and its user interface. Then, Chapter 6

is concerned with the results and discussion. Lastly, Chapter 7 is the conclusion of

this dissertation where it provides some conclusions, implementation of research

objectives, recommendations for future work and summary. All related documents

are presented in APPENDIX section.

Page 16: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter will briefly describe the dominant topics of the research such as

Sentence Grammar Overview and Standard Malay Language. Then, the types of

Malay language grammar that consist of sentence grammar are discussed. Next, Rule

for basic Malay Sentence followed by Malay Text categorization, Pre-Processing

with the parsing technique is presented to give a clear grasp on the research field

area. Finally, this chapter will come out with an overview of related studies, relevant

to the research field included to foster the main issues that have to be addressed. The

issues are: document representation and categorization of the Malay language

sentence based on Malay Context Free Grammar (CFG) rule.

2.2 Sentence Grammar Overview

Grammar is an inner regularity and a simple knowledge representation of language

(Chomsky, 1966). It occurs from language and plays the most important role in the

implementation of the fundamental aims of linguistic analysis. Grammar plays two

roles which are to separate the grammatical sentence from the ungrammatical

sequences and study the structure of grammatical sentences.

The grammar of a language is also a device that generates all grammatical

sentences of the language and none of the ungrammatical ones. Parsing a sentence is

Page 17: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

8

a difficult task. It is an initial step in understanding natural language, although

ambiguity is a serious problem that linguists face in building the algorithm for

natural language processing for sentence parser. An ambiguity problem occurs when

more than one parse tree are constructed. For example, the sentence “He saw the boy

with a telescope” can give two interpretations for readers. First reading: “He used the

telescope to see the boy”. Second reading: “He saw the boy who had a telescope”.

Both versions can also be interpreted by using tree structures. The first tree is shown

in Figure 2.1.

Figure 2.1: First Tree of “He saw the boy with a telescope” (Meyer et al., 2002)

Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase

(PP), Noun (N), Verb (V), Determiner (D), Preposition (P)

Page 18: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

9

Figure 2.2: Second Tree of “He saw the boy with a telescope” (Meyer et al., 2002)

Abbreviations: Sentence (S), Noun Phrase (NP), Verb Phrase (VP), Prepositional Phrase

(PP), Noun (N), Verb (V), Determiner (D), Preposition (P)

Figure 2.3: First Tree of “Kami adang air itu” (Karim, 2004)

Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK),

Kata Nama (KN), Kata Kerja Transitif (KKTr), Penentu (Pent)

Page 19: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

10

Figure 2.4: Second Tree of “Kami adang air itu” (Karim, 2004)

Abbreviations: Ayat (A), Subjek (S), Predikat (P), Frasa Nama (FN), Frasa Kerja (FK),

Kata Nama (KN), Penentu (Pent)

Structural ambiguity means a phrase or sentence can have more than one

structure as shown in Figure 2.1 and Figure 2.2. Other than structural ambiguity,

there are also three other types of ambiguity that usually occur in natural language

such as English and Malay language, which are part of- speech ambiguity, semantic

ambiguity, and verbal ambiguity (Jurasky et al., 2000). In minimizing the structural

ambiguity, Myparser is one of the solutions that can be used to encounter the

problem using inference rules created by Malay Context Free Grammar (CFG) rule.

2.3 The Standard Malay Language

The Standard Malay Language or Bahasa Baku (the word Baku comes from a

Javanese word which means true and correct) that was made upon agreement

between Malaysia, Indonesia, and Brunei is Bahasa Riau. This implies that the

spelling, words, phrasing, grammar, pronunciation, punctuation, sentences,

abbreviations, acronyms, capital letters, numbering, and style of the language are

already standardized (Khalifa et al., 2007).

The Malay language has their own context free grammar where there is a

combination of subject and a predicate in a sentence (Juzaiddin et al., 2006). It

requires a set of grammar rules which is also known as context-free grammar (CFG)

Page 20: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

11

in English or phrase structure rules (RSF) in Malay Language. Every sentence used

in a language is constructed according to the CFG, especially in Malay Language.

For this reason, there are lots of research that have been conducted in language

studies in producing a good sentence structure, especially in BM (Noor & Jamaludin,

2012). Sentence parser is one of the tools of technology that can be used in validating

a sentence to produce a good sentence structure. It is also known as a syntactic parser

by others researcher. It parses the sentence according to the CFG provided. Its

function is to validate the construction of words used in a sentence. If a sentence is

structured according to the rules of CFG, the parser will classify the sentence as true.

There are many studies conducted by Malay language researchers on

sentence parser as cited in Latif (1995), Ramli (2002), and Ahmad et al. (2007). The

studies could validate a sentence according to the CFG rules. Thus, this study is to

take up this challenge in producing an algorithm in the development of Malay Text

parser with categorization of sentence into correct structure of grammar rule. The

researcher studied on the Malay sentence similarity which is based on searching the

appropriate context within Malay sentence like pattern of words. They use the

context which was determined by seeking rules from a rule-based phrase database. In

implementing this approach (Rahman et al., 2011), working on prototype application

is described which can be used as a tool for improving writing text in Malay

language, especially well personalized toward the requirements of teaching and

learning this language in primary and secondary schools.

The challenges ongoing Malay Text based as the domain of research makes

this language are dominant to be used as sample when developing a parser. It is

different from English text where the rules of grammar phrase are simpler to

understand and NLP tool are based on English language.

Page 21: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

12

2.4 The Types of Malay Language Grammar

There are three types of grammar in Malay language; sentence grammar, partial

discourse grammar, and the pola (pattern) sentence grammar.

2.4.1 Sentence Grammar

This type of grammar uses personal, dialectal (the total amount of a language that

any person knows and uses), artificial sounding, and independent sentences as a

guide in making syntactic Malay sentences. Ayat (sentence) grammar has two

models, namely the transformational-generative grammar and the relational grammar

(Nik Safiah Karim, 1975). The transformational-generative grammar is a grammar

that consist a series of phrase-structure rewrite rules. For example, a series rules that

generates the underlying phrase structure of a sentence; and a series of rules that act

upon the phrase structure to form a more complex sentence. The relational grammar

is a theory of descriptive grammar which stated the syntactic operations such as the

relationship between subject and object. These two models are inherited of CFG.

2.4.2 CFG in Malay Language

CFG for Malay language was formed by Nik Safiah Karim (1995). It became the

basis in developing probabilistic for Malay language grammar. The CFG in forming

a basic sentence of the Malay language is pictured in Figure 2.5.

Page 22: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

13

Table 2.1: Description of Elements Used in Malay Grammar Rules

(Ahmad et al., 2007)

Page 23: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

14

Figure 2.5: CFG for Malay Language (Karim, 1995)

Page 24: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

15

2.4.3 Partial discourse grammar

A partial discourse is the grammar that picks out the sentences from discourse to

make linguistic statements about them. This type of grammar is different from

sentence grammar because it uses “language-first” approach in the writing of syntax

while sentence grammar uses “theory-first” approach. According to Simin (1988),

“language-first” approach represents a chance for Malay readers to read the latest

ideas in his own language about the genius of his language, while “theory-first”

approach is more likely to be used in order to make the chosen theory appear

workable.

Example of partial discourse grammar:

Aminah membaca buku. Dia juga mendengar radio.

Dia (She) is referring to Aminah

(Aminah is reading a book. She is also listening to the radio.)

2.4.4 ‘Pola’ (Pattern) Grammar

“Pola” grammar is the pattern of grammar in the sentences. This type of grammar

was used by Azhar Simin (1988). Each “pola” is linked to class-name that forms or

helps to make a basic sentence. Each “pola” is a formula to make a basic sentence.

Page 25: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

16

Example:

”Pola”: Pelaku + perbuatan

Pattern: Actor + verb

Sentence: Saya makan.

(I eat).

Karim (2004) represents the most theoretical work on pola grammar. It provides a

methodology for pola grammar writing. Below are the pola of grammar for Malay

language:

(i) Pelaku + Perbuatan (Actor + Verb)

(ii) Pelaku + Perbuatan + Pelengkap (Actor + Verb + Complement)

(iii) Perbuatan + Pelengkap (Verb + Complement)

(iv) Diterangkan + Menerangkan (Signified + Signify)

(v) Digolong + Penggolong (Classified + Classifier)

(vi) Pelengkap + Perbuatan + Pelaku (Complement + Verb + Actor)

(vii) Pelengkap + Perbuatan (Complement + Verb)

In forming a basic sentence in Malay language, the type of grammar that is

suitable to use is sentence grammar which provides rules. The rules are derived from

CFG that was mentioned in Figure 2.5.

Page 26: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

17

2.5 Rules for Basic Malay Sentence

Mainly, to create rules for a sentence in Malay language, we should follow CFG for

Malay language by Nik Safiah Karim (1995) as shown in Figure 2.5. A basic

sentence in Malay language can be derived from these four basic patterns of rules:

1) A → FN + FN

(S → NP + NP)

2) A → FN + FK

(S → NP + VP)

3) A → FN + FA

(S → NP + AP)

4) A → FN + FS

(S → NP + PP)

Where,

A = Ayat (sentence), FN = Frasa Nama (Noun Phrase), FK = Frasa Kerja

(Verb Phrase), FA = Frasa Adjektif (Adjective Phrase), FS = Frasa Sendi

(Prepositional Phrase)

These basic rules will be used in the implementation of inference rule inside

MyParser which will categorize the sentence structured with their own phrase.

Page 27: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

18

2.6 Malay Text Categorization

Malay Text Categorization means to categorize or classified the Malay Text into a

suitable label which can be applied in the Malay sentence. In this study, the goal of

text categorization is to classify some given new sentences into a fixed rule of

predefined categories or structures. Other researchers nowadays, especially in

Malaysia likes to explain the new prototypes of Malay Text processing on various

way and not focusing on commercializing it. As mention in previous chapter, to date,

there are no developed system that have been designed in processing Malay language

for sentence correction, just the prototype (Noor & Jamaludin, 2012).

In this study, the Malay Text Categorization is done based on the Inferences

Rule built from the CFG and implemented in MyParser. MyParser will categorize the

Malay simple sentence into its structure based on CFG and the output is labelled

using the bracket notation. The example will be explained in Chapter 5. Furthermore,

previous researcher explained Automated text categorization (TC) as a supervised

learning task, defined as assigning category labels (pre-defined) to new documents

based on the likelihood suggested by a training set of labelled documents (Yang &

Liu, 1999). The categories are just symbolic labels, and no additional knowledge (of

a procedural or declarative nature) of their meaning is available. No data provided

for classification purposes by an external source (exogenous data) knowledge is

available; therefore, classification must be accomplished on the basis of endogenous

knowledge only (i.e., knowledge extracted from the documents). Therefore, building

the parser for text categorization is a smart tool that can be used to differentiate or

categorized every word in the sentence with the phrase.

Page 28: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

19

2.7 Pre-processing

Prior to any deeper linguistic treatment of a text, the unit of text must be demarcated

and possibly classified. It can initially be viewed as mere sequences of character

within which we must define these unit (Grefenstette & Tapanainen, 1994).

Pre-processing plays an important role in developing MyParser, which in pre-

processing part, involves Natural Language Processing Toolkit (NLP toolkit). The

top down approach (Bill, 2002) is applied to the Malay language grammatical rules.

In addition, Malay words applicable to humans and animals are identified to achieve

a basic level of semantic parsing on top of syntactical parsing. The developed

prototype expects a user to input a sentence before parsing it for grammatical errors

check.

Next, the prototype should be able to display the graphical bracket notation

for a grammatical structure of the sentence so that the user would know the structure

arrangement of a correct sentence and develop the programming language for Malay

Text Categorization that make the application suit with Bahasa Melayu grammar. In

pre-processing part, NLP plays an important role because the method of producing

text categorization is implemented in it (Zakree et al., 2008). In understanding the

state of the art in Malay language technologies, they conducted three tests that are

available in NLP applications, in which two of the tools are Malay Sense Tagger

Prototype1 and a syntactic parser based on maximum entropy. Figure 2.6 shows the

pre-processing for conceptual clustering that similarly used to be implemented in text

categorization.

Page 29: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

20

Figure 2.6 : Pre-processing for Conceptual Clustering (Zakree et al., 2008)

Many of the knowledge-representation and inference techniques that have

been applied fruitfully in knowledge-based systems were originally developed for

processing natural language. However, the language-processing applications

themselves have always seemed far from being realized. This special series on NLP

is an attempt to bring language processing and its applications into focus - to

demonstrate techniques that have recently been applied to real-world problems, to

identify research ripe for practical exploitation, and to illustrate some promising

combinations of NLP with other emerging technologies.

In NLP, there is a different between language dependent and independent. A

language dependent system would be a system geared at a specific language, or a set

of languages. It might perhaps utilize manually built lexical resources such as

ontologies, thesauri or other language or domain specific knowledge bases (Hassel &

Dalianis, 2011). Other dependencies constraining a system to a specific language

may be the employments of advanced tools, for example, full parsers, semantic role

assigners or named entity tagging, or the use of techniques such as template filling.

Language independent usually denotes a NLP system that is easily dealt in between

different languages or domains. The system is thereby independent of the target

language.

Page 30: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

21

2.7.1 Parsing technique

Parsing a sentence is a process of breaking down a sentence into its component

words. It involves the use of linguistic knowledge of a language to discover the way

in which a sentence is structured. According to Charles (1994), there are two types of

parsing a sentence; top-down and bottom-up parse.

Top-down parse is a predictive process where the verification of a rule occurs

as the last step in the application of the rule. It begins with the start symbol at the top

of the parse tree and works downward. For example, replacing S with NP VP;

matching the left side of a rule to an appropriate symbol and then replacing that

symbol with the symbols on the right side of the rule.

In contrast, the bottom-up parse is a postdictive process, where a rule not

applied to the input required for its application is satisfied. It is referred to as bottom-

up parse because the construction of the tree begins from the terminal nodes of the

tree. For example, replacing NP VP with S; matching the right side of a rule to the

appropriate symbols and then replacing the matched symbols with the symbol on the

left side of the rule.

2.7.2 Error Analysis of Parsing Technique

According to Hendrickson (1979) who introduced the grid for the use of evaluation

of language performance (Figure 2.7), the grid format allows errors to be categorized

along two scales. The horizontal scale are categories by lexicon (vocabulary), syntax

(grammatical structure: word order, verb phrase and etc.), morphology (grammatical

agreement), and orthography (spelling errors). On the vertical scale, "global" refers

to errors that affect the organization of the entire sentence (for example missing

subjects or main verbs). "Local" errors affect only the constituent in which they

appear (such as a noun phrase or prepositional phrase). Problems area means to be

filled in with short descriptors of the errors. In spite of these differences,

Hendrickson’s grid provides a good framework on which to contrast an attempt to

predict the categories of errors which will be found by the syntactic parser and to

Page 31: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

22

describe how that parser might be designed to provide meaningful error messages to

the user.

Figure 2.7: The Format for Errors Analysis (Hendrickson, 1979)

Then, the average recall Avg (2.1) and the weighted average recall WAvg

(2.2) of correctly parsed sentences were reported using the following equations:

(2.1)

Where k = 3 represents three person levels, A the number of test cases, and B the

total number of sentences correctly parsed. The formula was introduced by Abidin

(2007).

(2.2)

The weighted of average recall was calculated mainly because the number of total

sentence correctly parsed are different for each other.

Page 32: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

23

2.8 Related Work

From the previous studies, the parser produced by Latif (1995) and Ahmad et.al.

(2007) was used in the study to produce output in the form of a syntax tree and

receiving input of sentence that can be seen as to have more in-depth studies

compared to Suzaimah’s parser. The Suzaimah’s parser has the limiting factor where

the input was only in Parlog code (Ramli, 2002). Besides, the resulting output

(Parlog clause) is hard to understand by some users.

2.8.1 Ahmad’s Malay Parser

The parser was developed by a group of researchers from Universiti Teknologi

Petronas led by Ahmad Izuddin Zainal Abidin (Ahmad et. al., 2007). This parser is a

type of syntactic parsing using a top-down parsing approach. The target of the parser

is to complete the existing word processing system by checking the grammar of a test

sentence. Another function of this parser is that it is able to illustrate a parse tree if

the sentence is grammatically correct. The research domain of the parser is Malay

language and focuses on basic Malay sentences. This parser also focused on the

semantic part, which is a basic level of semantic parsing on top of syntactical

parsing. In the semantic parsing, Malay words are divided into two categories:

humans and animals. Some examples of words for humans are ‘mengandung’

(pregnant), ‘memasak’ (cooking), and ‘berfikir’ (thinking) while examples of

animals are ‘meragut’(grazing), ‘mengawan’ (mating) and ‘bunting’(pregnant for

animals). The advantage of this parser is that it can handle the semantic ambiguity.

Case in point, the sentence ‘bapa meragut rumput' (a father is grazing grasses) failed

in parsing as the word ‘meragut’ is categorised under animals and not for humans.

This parser was evaluated by experts in Malay language, specifically school teachers.

Page 33: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

24

Figure 2.8: The Architecture of Ahmad's Malay Parser

Figure 2.8 illustrates the architecture system for Ahmad’s Malay Parser.

When a user inputs a sentence, the checking engine will parse the test sentence using

the text parser component. In the text parser component, there are two important

parts which form the technical structure of the parser. These parts are grammar rules

and Malay Lexicon. The grammar rules are derived from Karim (1995) while the

Malay Lexicon contains three thousand words (3000) and arranged according to

word categories. The words were collected from Kamus Dwibahasa Oxford Fajar 2nd

Edition (Hawkins, 2001).

Page 34: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

81

REFERENCES

Abidin, A. I., Yong, S. P., Kasbon, R., & Azman, H. (2007). Utilizing top-down

parsing technique in the development of a Malay language sentence

parser.Proceeding of 2nd International Conference on Informatics, 125–131.

Bill, W. (2002). Grammars and Parsing, Retrieved September 12, 2013, from

http://www.cse.unsw.edu.au/~billw/cs9414/notes/nlp/grampars.html

Blossom, A., Gebhard, D., Emelander, S. & Meyer, R. (2007). Software

Requirements Specification (SRS) Book E-Commerce System. Retrieved June

23, 2013, from http://www.cse.msu.edu/~chengb/RE-491/Papers/SRS-BECS-

2007.pdf

Charles F. S. (1994). Parsing Techniques: A practical Guide. State University of

New Jersey. Retrieved September 30, 2013, from http://www-

rci.rutgers.edu/~cfs/305_html/Understanding/Parsing.html

Cheng, B. H. C., (2007). “Intro to Specifications”. CSE 435, East Lansing, MI,

Computer Science and Engineering, Michigan State University

Chin, B. A., (2000). The Role of Grammar in Improving Student’s Writing. Retrieved

September 12, 2013, from http://www.sadlier-oxford.com/papers/chinpaper.html

Chomsky, N. (1966). Topics in the theory of generative grammar (Vol. 56). Walter de

Gruyter.

Dennis A., Wixom B., H., & Tegard D., (2005). System Analysis and Design with

UML Version 2.0. John Wiley & Sons Inc. ISBN 0-471-34806-6

Domeiji R., Knutsson O., & Eklundh K.S., (2002). Different Ways of Evaluating a

Swedish Grammar Checker, Proceedings of The Third International Conference

on Language Resources and Evaluation (LREC 2002). Las Palmas, Spain

Don, Z. M. (2010). Processing Natural Malay Texts: a Data-Driven Approach.

Trames. Journal of the Humanities and Social Sciences, 14(1), 90.

doi:10.3176/tr.2010.1.06

Grefenstette, G., & Tapanainen, P. (1994). What is a Word, what is a Sentence?:

Problems of Tokenisation (pp. 79-87). Rank Xerox Research Centre.

Hassel, M., & Dalianis, H. (2011). Portable Text Summarization. Applied Natural

Language Processing and Content Analysis: Advances in Identification,

Investigation and Resolution, 17.

Page 35: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

82

Hawkins J. M. (1997). Kamus Dwibahasa Oxford Fajar Edisi Kedua. Penerbit Fajar

Bakti Sdn. Bhd.

Hendrickson, J. M. (1979). Evaluating Spontaneous Communication Through

Systematic Error Analysis. Foreign Language Annals. 12(5), 357-364.

IEEE Recommended Practice for Software Requirements Specifications. (1998).

Retrieved June 23, 2013, from http://www.cse.msu.edu/~cse870/IEEEXplore-

SRS-template.pdf

Jurafsky, D., & James, H. (2000). Speech and Language Processing. Retrieved June

23, 2013, from http://www.deepsky.com

Juzaiddin, M., Aziz, A., Ahmad, F., Azim, A., Ghani, A., & Mahmod, R. (2006).

Pola grammar technique for grammatical relation extraction in Malay

language.Malaysian Journal of Computer Science, 19(1), 59.

Karim N. S. & Arbak O. (2004). Kamus Komprehensif Bahasa Melayu. Cetakan

Kedua. Penerbit Fajar Bakti Sdn. Bhd.

Kassim M. K., Kaedah Pengajaran Murid Sederhana Dan Lemah Retrieved April 30,

2013, from http://khirkassim.blogspot.com/2009/02/kaedah-pengajaran-pelajar-

sederhana-dan.html

Khalifa, O. O., Ahmad, Z. H. & Gunawan, T. S. (2007). SMaTTS : Standard Malay

Text to Speech System, 285–293.

Latif, R. A. (1995). Penyemak Sintaksis Ayat Bahasa Malaysia. Universiti

Kebangsaan Malaysia, Bangi, Selangor.

Muhamad N. Y. & Jamaludin, Z. (2012). Malay declarative sentence: Visualization

and sentence correction, Open Systems (ICOS), 2012 IEEE Conference on ,

vol., no., pp.1,5, doi: 10.1109/ICOS.2012.6417659

Noor, M. N. & Jamaludin, Z. (2012). Parser with Sentence Correction for Malay

Language ( BM ), 45(Icikm), 138–142.

Noor, M.Y. & Jamaludin, Z., (2012) Malay declarative sentence: Visualization and

sentence correction, Open Systems (ICOS), 2012 IEEE Conference on , vol.,

no., pp.1,5, 21-24 Oct. 2012 doi: 10.1109/ICOS.2012.6417659

Perkins J. (2010). Python Text Processing with NLTK 2.0 Cookbook. Packt

Publishing, Birmingham, UK. ISBN 978-1-849513-60-9

Pfleeger, S. L. (1998). Software Engineering Theory and Practice, International

Edition, USA: Prentice Hall.

Page 36: MYPARSER: A MALAY TEXT CATEGORIZATION TOOLKIT USING

83

Rahman, S. A., Omar, N. B., Mohamed, H., Juzaidin, M., & Aziz, A. (2011). A

Synonym Contextual-based Process for Handling Word Similarity in Malay

Sentence.

Ramli, S. (2002). Reka bentuk dan implementasi suatu penghurai bahasa Melayu

menggunakan sistem logik selari. Universiti Putra Malaysia, Selangor.

Simin, A. (1988). Discourse-Syntax of "YANG" in Malay (Bahasa Malaysia),

Dewan Bahasa dan Pustaka, Kuala Lumpur

Sukor, M. N., Jeon, K., Yusof, Z. A. & Yusuf, K. (2006). Kurikulum Bersepadu

Sekolah Rendah Bahasa Melayu Tahun 5. Dewan Bahasa dan Pustaka.

Surabhi, M.C. (2013). "Natural language processing future," Optical Imaging Sensor

and Security (ICOSS). International Conference on , vol., no., pp.1,3, 2-3 July

2013 doi: 10.1109/ICOISS.2013.6678407

Talita, P. & Yeo, A. W. (2010). Challenges in Building Domain Ontology For

Minority Languages, (ICCAIE), 574–578.

Yang, Y. & Liu, X. (1999). A re-examination of text categorization methods.

Proceedings of the 22nd annual international ACM SIGIR conference on

Research and development in information retrieval - SIGIR ’99, 42–49.

doi:10.1145/312624.312647

Zakree, M. & Nazri, A. (2008). An Exploratory Study on Malay Processing Tool for

Acquisition of Taxonomy Using FCA. doi:10.1109/ISDA.2008.254

Zakree, M., Nazri, A., Shamsudin, S. M. & Bakar, A. A. (2008). An Exploratory

Study of the Malay Text Processing Tools in Ontology Learning.