Final Report 3

TRIBHUVAN UNIVERSITY

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

A

FINAL YEAR PROJECT REPORT

ON

Name Conflict Resolution for Company Registration

By:

Gaurav Kumar Goyal (16214)

Janardan Chaudhary (16216)

Nimesh Mishra (16221)

Sanat Maharjan (16230)

A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

AND COMPUTER ENGINEERING IN PARTIAL FULLFILMENT OF

THE REQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER ENGINEERING

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINNERING

LALITPUR, NEPAL

AUGUST, 2013

i

INSTITUTE OF ENGINEERING

PULCHOWK CAMPUS

DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

The undersigned certify that they have read, and recommended to the Institute of Engineering

for final submission and presentation of the project entitled "Name Conflict Resolution for

Company Registration" submitted by Gaurav Kumar Goyal, Janardan Chaudhary, Nimesh

Mishra and Sanat Maharjan in partial fulfilment of the requirements for the Bachelors

degree in Computer Engineering.

_________________________________________________

Supervisor, Prof. Dr. Shashidhar Ram Joshi

Department of Electronics and Computer Engineering

_________________________________________________

Co-Supervisor, Er. Sansar Jung Dewan

IT Officer, Office of Company Registrar (OCR)

__________________________________________________

Internal Examiner, Baburam Dawadi


__________________________________________________

External Examiner, Anjesh Tuladhar

COO, Young Innovations Pvt. Ltd.

DATE OF APPROVAL: 25 Aug. 2013

ii

COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer

Engineering, Pulchowk Campus, Institute of Engineering may make this report freely

available for inspection. Moreover, the author has agreed that permission for extensive

copying of this project report for scholarly purpose may be granted by the supervisors who

supervised the project work recorded herein or, in their absence, by the Head of the

Department wherein the project report was done. It is understood that the recognition will be

given to the author of this report and to the Department of Electronics and Computer

Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this

project report. Copying or publication or the other use of this report for financial gain without

approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus,

Institute of Engineering and authors written permission is prohibited.

Request for permission to copy or to make any other use of the material in this report in

whole or in part should be addressed to:

Arun Timilsina, PhD/ Professor

Head of Department


Pulchowk Campus, Institute of Engineering

Lalitpur, Kathmandu

Nepal

iii

ACKNOWLEDGEMENT

First of all, we would like to express our sincere gratitude towards Department of Electronics

and Computer, Pulchowk Campus for including final year major project as part of our

syllabus for final year B.E. in Computer. We would like to extend our gratitude towards Dr.

Arun Timilsina, Head of Department, Electronics and Computer Engineering, for assisting

us in our project.

We would like to take the privilege to express our gratitude towards Prof. Dr. Shashidhar

Ram Joshi for being our project supervisor.

We would also like to thank Dr. Aman Shakya for his support and assistance. We are deeply

indebted to Er. Sansar Jung Dewan of Office of Company Registrar and the Office of

company Registrar itself for giving us an opportunity to do this project with enormous

scopes.

We would also like to express our sincere thanks to Mr. Bal Krishna Bal, Assistant professor,

Department of Electronics and Computer Engineering, Kathmandu University, for this help

and support.

Last but not the least we would like to thanks our friends and classmates for their help and

valuable suggestions.

iv

ABSTRACT

Natural language processing is one of the most researched field. One of the applications of

natural language processing is determining similarity of sentences. Naming conflict

resolution is about comparing of words. There are many systems developed for this purpose

and are used widely.

In context of Nepal, the existing system for resolving naming conflicts during registration of

a company is done manually (by human). However, there exists requirement for automation

of the process. The automation requires natural language processing, translation of

languages, transliteration between languages. There are several constraints for the checking

provided by the Office of Company Registrar (OCR). These constraints should be considered

while comparing words. The words need to be tokenized, stemmed before they can be further

processed.

Keywords:

OCR, Morphological Analysis, Similarity Matching, Natural Language Processing.

v

TABLE OF CONTENT COPYRIGHT .................................................................................................................... ii

ACKNOWLEDGEMENT ................................................................................................ iii

ABSTRACT ..................................................................................................................... iv

TABLE OF CONTENT ..................................................................................................... v

TABLE OF FIGURES .................................................................................................... viii

Chapter 1 ........................................................................................................................... 2

INTRODUCTION .......................................................................................................................... 2

1.1 Background....................................................................................................................... 2

1.2 Motivation ........................................................................................................................ 3

1.3 Problem Statement ........................................................................................................... 3

1.4 Objectives ......................................................................................................................... 4

1.5 Scope of the work ............................................................................................................. 4

Chapter 2 ........................................................................................................................... 5

LITERATURE REVIEW .................................................................................................................. 5

2.1 Introduction...................................................................................................................... 5

2.2 Common processes used in text similarity......................................................................... 5

2.2.1 Downcasting .............................................................................................................. 5

2.2.2 Transformation .......................................................................................................... 5

2.2.3 Stopword Removal ..................................................................................................... 5

2.2.4 Tokenization .............................................................................................................. 6

2.2.5 Stemming .................................................................................................................. 6

2.3 Existing Name checking Systems ....................................................................................... 6

2.4 Criteria defined by OCR ..................................................................................................... 7

2.5 Matching Techniques ........................................................................................................ 8

2.5.1 Phonetic encoding ..................................................................................................... 8

2.5.1.1 Soundex .............................................................................................................. 8

2.5.1.2 Metaphone ......................................................................................................... 9

2.5.2 Pattern matching ....................................................................................................... 9

2.5.2.1 Levenshtein or Edit Distance.............................................................................. 10

2.5.2.2 Sorenson similarity ............................................................................................ 10

2.5.2.3 Cosine Similarity ................................................................................................ 11

2.6 Summary ........................................................................................................................ 11

Chapter 3 ......................................................................................................................... 12

REQUIREMENT ANALYSIS ......................................................................................................... 12

vi

3.1 Functional Requirements ................................................................................................ 12

3.2 Non-Functional Requirements ........................................................................................ 12

3.2.1 Reliability ................................................................................................................. 12

3.2.2 Performance ............................................................................................................ 12

3.2.3 Accuracy .................................................................................................................. 13

Chapter 4 ......................................................................................................................... 14

METHODOLOGY ....................................................................................................................... 14

4.1 Introduction.................................................................................................................... 14

4.2 System Design ................................................................................................................ 15

4.2.1 Flow Diagram ........................................................................................................... 16

4.2.2 Deployment Diagram ............................................................................................... 17

4.2.3 System Architecture ................................................................................................. 18

4.2.3.1 Preprocessing Engine ........................................................................................ 19

4.2.3.2 Translation and Transliteration .......................................................................... 20

4.2.3.3 Possible Keyword Generation ............................................................................ 21

4.2.3.5 Ranking ............................................................................................................. 22

4.2.4 Detailed Class Diagram ............................................................................................. 20

4.3 Project Tools ................................................................................................................... 23

4.4 Eclipse as Programming IDE ............................................................................................ 23

4.5 MySQL as Database System ............................................................................................ 23

Chapter 5 ......................................................................................................................... 24

EXPERIMENTAL SETUP .............................................................................................................. 24

Chapter 6 ......................................................................................................................... 25

OUTPUT ................................................................................................................................... 25

Chapter 7 ......................................................................................................................... 27

RESULT AND ANALYSIS ............................................................................................................. 27

Chapter 8 ......................................................................................................................... 29

CONCLUSION AND FURTHER ENHANCEMENT ........................................................................... 29

7.1 Conclusion ...................................................................................................................... 29

7.2 Limitations ...................................................................................................................... 29

7.3 Further Enhancement ..................................................................................................... 30

REFERENCE .................................................................................................................. 31

APPENDIX A: Gantt chart .............................................................................................. 34

APPENDIX B: Use Case ................................................................................................. 35

APPENDIX C: Preprocessing Detail Example ................................................................. 36

vii

APPENDIX D: Comparison Detail .................................................................................. 37

APPENDIX E: Output Screenshot ................................................................................... 41

APPENDIX F: Data Flow Diagram ................................................................................. 42

APPENDIX G: Theory .................................................................................................... 43

viii

TABLE OF FIGURES

Figure 1 Flow Chart ...................................................................................................................... 16

Figure 2 Deployment Diagram ...................................................................................................... 17

Figure 3 System Architecture........................................................................................................ 18

Figure 4 Preprocessing Engine ...................................................................................................... 19

Figure 5 Detailed Class Diagram ................................................................................................... 20

Figure 6 Example - I ...................................................................................................................... 25

Figure 7 Example - II ..................................................................................................................... 25

Figure 8 Example- III ..................................................................................................................... 25

Figure 9 Example - IV .................................................................................................................... 25

Figure 10 Example- V .................................................................................................................... 26

Figure 11 Example - VI .................................................................................................................. 26

Figure 12 Example - VII ................................................................................................................. 26

Figure 13 Computation Time with Transformation ....................................................................... 27

Figure 14 Time Computation with Transformation ....................................................................... 28

Figure 15 Gantt Chart ................................................................................................................... 34

Figure 16 Use Case Diagram ........................................................................................................ 35

Figure 17 Comparison I (Part A) .................................................................................................. 37

Figure 18 Comparison I (Part B)................................................................................................... 38

Figure 19 Comparison II (Part A) ................................................................................................. 39

Figure 20 Comparison II (Part B) ................................................................................................. 40

Figure 21 Output Screenshot ........................................................................................................ 41

Figure 22 Data Flow Diagram........................................................................................................ 42

2

Chapter 1

INTRODUCTION

1.1 Background

Trying to understand language as a unit in machine terms is not as easy as it is thought.

Words are perhaps the most intuitive units of language, yet they are in general tricky to

define. Words are defined in most languages as the smallest linguistic units that can form a

complete utterance by themselves. Natural language processing deals with the ambiguity in

word processing.

The office of company registrar is responsible for maintaining law and order regarding

different companies. Almost all of the daily task of the office used to be manual, now the

OCR has moved ahead for the automation of tasks using computerized systems. Before the

advent of current online system, the process relating to change, admission, and removal of

company names used to be difficult and cumbersome. Even after the recent development of

online system of the office, the system is isn't intelligent enough. Currently the Office of

Company Registrar (OCR) has implemented database entity comparison features. The

process of finding company names is often based on English names. Comparison features is

however limited to entity to entity match and phonetic based matching. The existing system

often fails to act responsively and accurately during the process related to a new company

registration. The current system is severely limited due to the above mentioned comparison

method. The same problem arises while a new company tries to reserve their company name.

Naming conflict resolution system for company registration is a system that finds the

similarity between the proposed name of a company and existing company names in

database. This requires the use of some of the traits of natural language processing. First of

all, the input is down casted and stop-words are removed from the proposed name. The name

is then transformed, tokenized, stemmed to determine the root words used in similarity

checking. The words are then used to form some of probable tokens using translation and

transliteration process. These names are then matched with words from database to form the

ranking of similar names.

The system requires to translate Nepali words to English words and vice-versa. The

translation is done with the help of dictionaries. The removal of stop-word requires pool of

pre-defined words to be removed. The constraints are defined by the Office of Company

Registrar. These constraints include use of plural words, case sensitivity, punctuation and

3

spacing in the names, use of numbers, different phonetic spellings or spelling variations and

many others. The system will also assist in decision making process, whether or not to

approve the proposed name. This system will result in efficient processing, and faster

registration of names.

1.2 Motivation

Almost all of the daily task of the office used to be done manually. But now the OCR has

moved ahead for the automation of tasks using computerized systems. Before the advent of

current online system, the process relating to change, admission, and removal of company

names used to be difficult and cumbersome. Even after the recent development of online

system of the office, the system is isn't intelligent enough. Currently the Office of Company

Registrar (OCR) has implemented database entity comparison features. The process of

finding company names is often based on English names. Comparison features is however

limited to entity to entity match and phonetic based matching. The existing system often fails

to act responsively and accurately during the process related to a new company registration.

The current system is severely limited due to the above mentioned comparison method.

These limitations in current system motivated us to develop a more reliable and accurate

system based on String Matching Algorithms, which produces more accurate results than the

Phonetic based string matching approach currently used.

1.3 Problem Statement

A recent improvement in the registration of new companies is the addition of the online

registration and name checking system. However, the current name checking system faces

from lack of accuracy and drawbacks of matching names regarding to their phonetic

pronunciation.

In our current project, we try to build a system that checks the validity of the purposed names

by using string matching schemes rather than phonetic. Our objective is to determine that

extent to which the purposed name is similar to existing name , and based on this we

determine whether the name is available for registration .

4

1.4 Objectives

The main objective of the project is to develop a system capable of checking the similarity

of the purposed company names with registered ones. The objectives can be further be

simplified as:

1. To develop a system to resolve naming conflict.

2. To find names similar to the name proposed by user.

3. To provide the ranks of matched proposed name with other existing names.

4. To define the threshold level used to validate name

1.5 Scope of the work

Name checking system is used in many countries to check the purposed name of a company.

Variety of approaches is available to develop such name checking system. The approach

used here is NLP approach. The system will be able to check the purposed name with much

better accuracy than the current system. This system will be beneficial to the clients and the

OCR. This system is based on research along with study and analysis of existing system. The

system will produce output in the form of .csv file containing the similarity scores of various

names with the purposed name.

5

Chapter 2

LITERATURE REVIEW

2.1 Introduction

This project is all about checking the validity of the purposed company names for the Office

of Company Registrar. One of the important steps while developing such a system is to

examine all the research areas thoroughly. It is important to know about Natural Language

Processing in order to know about the processes used in this project. Also for designing this

system, existing systems are studied thoroughly.

Natural Language Processing (NLP) is a branch of information machine science that deals

with natural language information. NLP is a component of artificial intelligence. NLP is a

form of human-to-computer interaction where the elements of human language, be it spoken

or written, are formalized so that a computer can perform value-adding tasks based on that

interaction. Human language is dauntingly complex for a computer to understand. NLP is

used in various areas like language translation, speech processing, checking for grammatical

errors, etc.

2.2 Common processes used in text similarity

It is always useful to know about different types of processes used for NLP. Some of the

common processes are mentioned below:

2.2.1 Downcasting

Downcasting also referred as type refinement is act of casting script from uppercase

letters to lowercases. It is done so as to make sure there is no conflict in company names

due to uppercase letters between the words to make it a unique name.

2.2.2 Transformation

Transformation is the conversion of words from British English word to that to American

English words. Transformation is done to avoid the generation of unwanted keywords or

conflicting keywords

2.2.3 Stopword Removal

Stop word removal is the process of removing some predefined stop words from the

string literal. We used this process to remove the words that are considered

similar/unimportant defined by Office of the Company Registrar directives.

6

2.2.4 Tokenization

Tokenization is the process of breaking up a string into tokens to be indexed using

predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries

can be a pool of predefined words or bilingual English-Nepali dictionary.

2.2.5 Stemming

Stemming is the process of reducing a word to a root, or simpler form which are present

in plural forms. Stemming is often used in text processing applications. There are many

different approaches to stemming, each with their own design goals. Some are

aggressive, reducing words to the smallest root possible.

2.3 Existing Name checking Systems

In order to develop an effective name checking system, it is important to study many similar

existing systems so that the system to be developed covers some of the deficiencies of these

systems. We mainly focused on the existing system used in OCR Nepal. A name checking

system takes the name purposed by the customer and compares with the similar already

existing names. Based on the results, it determines if the name is allowed to be registered.

1. Office of Company Registrar, Nepal

This system uses Phonetic algorithms to check the names. The customer has to visit

the homepage of the OCR [1] and enter the purposed name. The system checks this

name with already existing names and determines if the name is valid. The existing

system however faces the problem of lack of accuracy.

2. Companies House, United Kingdom

This system is used by the government of United Kingdom to check the purposed

name. The client can visit the website [2] and check for the name intentioned. The

system returns the list of existing similar names.

3. CIPC

CIPC stands for Companies and Intellectual Properties Commission. It is a system

that checks the availability of the name purposed by the customer. The client can visit

the website [3] register by paying the fee and then check his/her intentioned company

name. The CIPC will check the name against existing registered businesses and reject

the names that are too similar. The system will also check if the name is reserved or

not.

7

2.4 Criteria defined by OCR

In approving a proposed name of company, the following shall not be considered different or

distinguishable:

1. The words Private, Pvt., (P), Limited, Ltd, Ltd., Limited Liability.

2. The words appearing at the end of the names company, and company, co., co.

3. The plural version of any of the words appearing in the name.

4. The type and case of letters, spacing between letters and punctuation marks;

5. Joining words together or separating the words, as this does not make a name

distinguishable from a name that uses the similar, separated or joined words. For

example: Him Shikhar Travels Pvt. Ltd. will be considered as similar to Himshikhar

Travel.

6. The use of number of the same word and (the use of tense in English), as this does not

distinguish one name from another. Such as, Three Six Five Tours and Travels Pvt.

Ltd. will be to 365 Tours and travels Pvt. Ltd.

7. Using different phonetic spellings or spelling variations, as this does not distinguish

one name from another. For example, S.D. Enterprises limited is existing then S and

D Enterprises or Satya Darshan Enterprises will not be allowed.

8. Similarly if a name contains numeric character like 3, 6, and 7 resemblance shall be

checked with Three, six, and seven.

9. The use of an internet related designation, such as .COM, .NET, .EDU, GOV, .ORG,

.IN, as this does not make a name distinguishable from another.

10. The addition of words like New, Modern, Nav, Shri, Sri, Shree, Sree, Om, Jai, Sai,

The, etc., as this does not make a name distinguishable from an existing name such

as New Kantipur Publication Pvt., Shree Sai Enterprises.

11. The adding the name of the place like Kathmandu, Janakpur as this does not make a

name different or distinguishable. For example, Kathmandu Sugam Pharmaceuticals

Private Ltd. cannot be allowed if Sugam Pharmaceuticals Private Ltd already

exists;Such names may be allowed only if no objection from the existing company

by way of Board resolution is produced/ submitted.

12. Different combination of the same words, as this does not make a name

distinguishable from an existing name, e.g., if there is a company in existence by the

8

name of Builders and Contractors Limited, the name Contractors and Builders

Limited should not be allowed.

13. Exact Nepali translation of the name of an existing company in English or other

language. For example, Kathmandu Dairy Industry Limited will not be allowed if

there exists a company with name Kathmandu Dugdh Udyog Limited.

2.5 Matching Techniques

Name matching can be defined as the process of determining whether two name strings are

instances of the same name [18]. As name variations and errors are quite common [17], exact

name comparison will not result in good matching quality. Rather, an approximate measure

of how similar to names are is desired. Generally, a normalized similarity measure between

1.0 (two names are identical) and 0.0 (two names are totally different) is used.

The two main approaches for matching names are phonetic encoding and pattern matching.

Different techniques have been developed for both approaches, and several techniques

combine the two with the aim to improve the matching quality.

2.5.1 Phonetic encoding

Common to all phonetic encoding techniques is that they attempt to convert a string into a

code according to how a string is pronounced (i.e. the way a string is spoken).

Naturally, this process is language dependent. Most techniques have been developed mainly

with English in mind.

2.5.1.1 Soundex

Soundex based on English language pronunciation, is the and best known phonetic encoding

algorithm. It keeps the first letter in a string and converts the rest into numbers according to

the following encoding table.

a,e,h,i,o,u,w,y

b,f,p,v

c,g,j,k,q,s,x,z

d,t

l

m,n

r

0

1

2

3

4

5

6

9

All zeros (vowels and h, w and y) are then removed and sequences of the same number

are reduced to one only (e.g. 333 is replaced with 3). The final code is the original first

letter and three numbers (longer codes are cut-off, and shorter codes are extended with

zeros). As examples, the Soundex code for peter is p360, while the code for christen is

c623. A major drawback of Soundex is that it keeps the first letter, thus any error or

variation at the beginning of a name will result in a different Soundex code.

2.5.1.2 Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing

words by their English pronunciation. It fundamentally improves on the Soundex algorithm

by using information about variations and inconsistencies in English spelling and

pronunciation to produce a more accurate encoding, which does a better job of matching

words and names which sound similar. As with Soundex, similar sounding words should

share the same keys.

The original author later produced a new version of the algorithm, which he named Double

Metaphone. Contrary to the original algorithm whose application is limited to English only,

this version takes into account spelling peculiarities of a number of other languages. In 2009

Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy

of approximately 99% for English words, non-English words familiar to Americans, and first

names and family names commonly found in the United States, having been developed

according to modern engineering standards against a test harness of prepared correct

encodings.

2.5.2 Pattern matching

Pattern matching techniques are commonly used in approximate string matching [24, 25],

which has widespread applications, from data linkage [22, 23] and duplicate detection [20,

21], information retrieval [26], correction of spelling errors [27], approximate database joins,

to bio- and health informatics [25]. These techniques can broadly be classified into edit

distance and q-gram based techniques, plus several techniques specifically developed for

name matching.

A normalized similarity measure between 1.0 (strings are the same) and 0.0 (strings are

totally different) is usually calculated. We will denote the length of a string s with |s|.

10

2.5.2.1 Levenshtein or Edit Distance

The Levenshtein distance [28] is defined to be the smallest number of edit operations

(insertions, deletions and substitutions) required to change one string into another. In its basic

form, each edit has cost 1. Using a dynamic programming algorithm [17], the distance

(number of edits) between two strings s1 and s2 can be calculated in time O(|s1| |s2|) using

O(min(|s1|, |s2|)) space. The distance can be converted into a similarity measure (between

0.0 and 1.0) using

ld (s1,s2)= 1 (s1,s2)

max(|1|,|2|) (1)

with ld (s1,s2) being the actual Levenshtein distance function which returns a value of 0

if the strings are the same or a positive number of edits if they are different. The second

property allows quick filtering of string pairs that have a large difference in their lengths.

The distance between "Bob" and "Bob" is zero (0), because no edits are required to convert

a string into itself. The edit distance between strings is only zero if the strings are identical.

The distance between "Brett" and "Brent" is one (1), because it requires a substitution of an

n for a t. The distance between "Brett" and Bret is one, requiring the deletion of one of

the two t characters in "Brett". The sequence of edits must be minimal, but need not be

unique. Further note that "Bret" can be converted to "Brett" with a single insertion of a t

character.

The distance between "Bob" and "bob" is also 1, as it requires the substitution of a lowercase

'b' for its uppercase equivalent B.

Levenshtein Distance is used to calculate the similarity of 2 strings. A standard Levenshtein

Distance is about ~40% accurate [19], by standardizing the orthography of the strings this

can be improved to a max of ~65% [3].

2.5.2.2 Sorenson similarity

The Sorenson index, also known as Sorensons similarity coefficient, is a statistic used for

comparing the similarity of two samples. It was developed by the botanist Thorvald Sorenson

and published in 1948. Sorenson's original formula was intended to be applied to

presence/absence data, and is

=2

+=

2||

||+|| - (2)

11

where A and B are the number of species in samples A and B, respectively, and C is the

number of species shared by the two samples; QS is the quotient of similarity and ranges

from 0 - 1. This expression is easily extended to abundance instead of presence/absence of

species. The Sorenson index is identical to Dice's coefficient which is always in [0, 1] range.

2.5.2.3 Cosine Similarity

The cosine of two vectors can be easily derived by using the Euclidean dot product formula:

. = |||| - (3)

Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot

product and magnitude as

= =.

||||=

i i=1

(i)2=1 (i)2

=1

- (4)

The resulting similarity ranges from 1 meaning exactly opposite, to 1 meaning exactly the

same, with 0 usually indicating independence, and in-between values indicating intermediate

similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the

documents. The cosine similarity can be seen as a method of normalizing document length

during comparison.

2.6 Summary

1. The background study focused on the uses of name checking systems, their

effectiveness and usefulness.

2. It helped us how to design, methodologies, and programming tools that should be

used to develop this system.

3. It also emphasized on the existing systems, their merits and flaws in them.

12

Chapter 3

REQUIREMENT ANALYSIS

3.1 Functional Requirements

1. A true reflection of lexical similarity

Strings with small differences should be recognized as being similar. In particular, a

significant substring overlap should point to a high level of similarity between the strings.

2. A robustness to changes of word order

Two strings which contain the same words, but in a different order, should be recognized

as being similar. On the other hand, if one string is just a random anagram of the

characters contained in the other, then it should (usually) be recognized as dissimilar.

3. Language Independence

The system should work not only for English words, but also for Nepali words.

4. Output file format

The result should be stored in a file in comma separated variable (csv) format.

5. Easy integration

The system should be easy to integrate with the existing system. The system should be

easy to maintain by the maintenance personnel.

3.2 Non-Functional Requirements

3.2.1 Reliability

It is required that the system should be available all the time. This can be achieved by hosting

the system in a reliable server. Also the system is built using Java, this adds more confidence

to the system. Java has built in memory management system.

3.2.2 Performance

The system would be used by numerous customers throughout the country. So it was required

that the system should take minimum time to produce output. The main concern was the time

taken to query database system to extract the relevant names and calculate the similarity

scores. This time depend upon the type of processor used. The overall time required to obtain

output after the submission of name by the customer was summed up to about 1 minute but

again, this time depends upon the number of tokens generated.

13

3.2.3 Accuracy

The system is purposed to be real time, so it is required that the high accuracy is maintained.

This is ensured by using Morphanalyser, Levenshtein Algorithm in conjunction with Kuhn-

Mukres Hungarian Algorithm and Sorensen Algorithm.

14

Chapter 4

METHODOLOGY

4.1 Introduction

Methodology is analysis of the tasks to be done in order to obtain the desired output. An

appropriate methodology mainly results into a successful project and vice-versa. Here, for

this system, a number of methodologies were considered and the most efficient ones are

used. This doesnt mean that one particular method is used. According to the system, the

most appropriate ones are used in combination.

The model used here is an iterative model i.e. in the beginning a small subset of the software

requirement is developed and then using the concept of redesign and redevelopment its

further versions are enhanced. This process is continued until and unless the desired system

is developed that produces results as mentioned in the system requirements.

The methodology once decided is changed during the project if there arise any circumstances

where the design emerged any flaws. Thus based on the situations appropriate methodologies

are implemented. Hence in our scenario methodology comprises of five different steps.

1. Building Base Dictionary

2. Possible Keyword Generation

3. Finding Possible Matches

4. Finding Duplicates

5. Finding Ranks

1. Building Base Dictionary

A base dictionary can be generated from the existing name database provided by OCR. This

can be done by using manual approach. Base dictionary used in our project consist of a file

containing English words, a dictionary for transliteration, and Nepali to English dictionary

(provided by Madan Pustakalaya). These dictionary helps us in tokenization and possible

keywords generation.

2. Possible Keyword Generation:

After tokenizing the given name, a possible combination of the keywords is generated using

both English and Nepali words similar to them. After obtaining base keywords, these

keywords are transliterated and combined in every possible manner to form the probable

similar keywords. These keywords are used to match against names in OCR Names database.

15

3. Finding Possible Matches:

Possible names generated using base keywords are matched against OCR names database.

For this, the names containing any of the keywords are extracted from the names database.

Each of the name is checked against the purposed name. The aim is to collect as many records

as possible for better results. These records can contain duplicates too.

4. Finding Duplicates Matches:

The names extracted from the Names database may occur more than once. So, the names

that appear more than once are removed. Duplication occurs when a name in the database

contains two or more of the probable keywords.

5. Finding Ranks:

The purposed name is assigned a value against each name extracted from the Names

database. The value signifies the extent of matching. For calculating the match, we used

Levenshtein algorithm

The Kuhn-Munkres algorithm (also known as the Hungarian method)

The purposed name is assigned a value against each name extracted from the Names

database. The value signifies the extent of matching. For calculating the match, we used

Levenshtein algorithm to calculate similarity between tokens of purposed name and name

extracted from database.

The Kuhn-Munkres algorithm (also known as the Hungarian method) to find the optimal

assignment of similarity weight between tokens of two strings in comparison that maximizes

the sum of similarity weight.

Sorensons similarity coefficient to find the single value similarity score (which is between

0 and 1) from the result obtained through Hungarian.

4.2 System Design

This section gives a detail review on the design on which the system developed is

implemented. It includes

1. Flow diagram

2. Deployment diagram

3. System architecture

4. Detail class diagram

16

4.2.1 Flow Diagram

Figure 1 Flow Chart

17

4.2.2 Deployment Diagram

The application is built around client/server architecture. Multiple client machines can

interact with the server simultaneously. Clients can interact with the system through an

interactive OCRs website, while the server serves the clients request and does the

processing in the backend.

Figure 2 Deployment Diagram

18

4.2.3 System Architecture

User Input

Query Processing

Preprocessing

Engine

Translator + Transliterator

Dictionary

English-Nepali

Keywords Generator

Result

Visualization Database

Index Processor

Indexed

Record Preprocessing

Engine

Comparator

Ranking Engine

Figure 3 System Architecture

19

4.2.3.1 Preprocessing Engine

Preprocessing Engine comprises of five different processes on the user input.

1. Downcasting

Downcasting also referred as type refinement is act of casting script from uppercase

letters to lowercases. It is done so as to make sure there is no conflict in company names

due to uppercase letters between the words to make it a unique name.

2. Transformation

Transformation is the conversion of words from British English word to that to American

English words. Transformation is done to avoid the generation of unwanted keywords or

conflicting keywords. Our dictionary consist of around 130 commonly used words that

is converted when found from British English word to American English word.

3. Stopword Removal

Stop word removal is the process of removing some predefined stop words from the

string literal. We used this process to remove the words that are considered

Transformation

Stopword removal

Tokenization

Stemming

Pool of stopwords

Downcasting

Figure 4 Preprocessing Engine

20

similar/unimportant according to the Office of the Company Registrar. Words such as

Shree, New, Modern, Industry, Udyog, Company, etc. are removed.

4. Tokenization

Tokenization is the process of breaking up a string into tokens to be indexed using

predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries

can be a pool of predefined words or bilingual English-Nepali dictionary. Proper

handling of strings, numbers and symbols are also important. For instance, tokenizing

"nepal metals outputs nepal and metals.

5. Stemming

Stemming is the process of reducing a word to a root, or simpler form which are present

in plural forms. Stemming is often used in text processing applications. There are many

different approaches to stemming, each with their own design goals. Some are

aggressive, reducing words to the smallest root possible. Here, Stemming is done with

the help of morphological analyzer. Morphological analysis is done in order to produce

English dictionary based words. For example, words like services, metals are

reduced to simpler singular forms as service and metal.

We used stemming to obtain the dictionary based root words. Using root words, we

simplified the matching process.

4.2.3.2 Translation and Transliteration

Translation is the conversion of the meaning of a source-language text by means of

an equivalent target-language text. In this process, equivalent Nepali text is obtained of the

English words as obtained by mapping each keyword matched accordingly with the English

Dictionary. The matched word are then mapped with the English-Nepali Dictionary provided

by Madan Puraskar Pustakalaya. The unmatched words are simply placed with translated

tokens. For Example the word nepal, metal is mapped onto the dictionary to get the word , .

Transliteration is the conversion of a text from one script to another. To transliterate a

Nepali word to English word, we used dictionary mapping to map individual Nepali syllable

to form English alphabet. Here in above example of translation the word , are

21

transliterated to Nepal and dhatu and then extracted to the pool of keywords for further

processing.

4.2.3.3 Possible Keyword Generation

Keywords are generated hence by the combination of keywords from stemming and after

transliteration. The generated keywords are hence used to make a list of company names

having those keywords in their names from the database .The company names are hence

listed in accordance with the presence of those keywords. Each company name in the list is

again processed by the preprocessing engine and stemmed keywords are extracted to process

further for comparison which is kept as indexed record for each company name taken from

the database.

4.2.3.4 Comparison

Comparison is done between the token obtained with the user inputted company name and

tokens generated by the company names extracted from the database based on the user

inputted keywords.

Levenshtein Algorithm and The Kuhn-Munkres algorithm (Hungarian Method) were used

in comparison of strings. The similarity is calculated in three steps:

Partition each name into a list of tokens.

Eliminate the common tokens

Compute the similarity between dissimilar tokens by using a string edit-distance

algorithm

The first method uses an edit-distance string matching algorithm: Levenshtein. The string

edit distance is the total cost of transforming one string into another using a set of edit rules,

each of which has an associated cost. Levenshtein distance is obtained by finding the

cheapest way to transform one string into another. Transformations are the one-step

operations of (single-phone) insertion, deletion and substitution. In the simplest version

substitutions cost about two units except when the source and target are identical, in which

case the cost is zero. Insertions and deletions costs half that of substitutions.

Application of Hungarian Algorithm for Optimization

The result of Levenshtein method is used in bipartite graph which used Hungarian algorithm.

A related classical problem on matching in bipartite graphs is the assignment problem, which

22

is the quest to find the optimal assignment of workers to jobs that maximizes the sum of

ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.

All relation scores are in the [0, 1] range, which means that if the score gets a maximum

value (equal to 1) then the two string are absolutely similar.

Application of Sorensons Similarity coefficient

The result of Hungarian method which is the sum of similarity weight is then applied to

Sorenson Index to find the final single value similarity score between the strings to be

compared. This final score (whose value lie between 0 and 1) is then converted into

percentage by multiplying by 100.

4.2.3.5 Ranking

The result of each and every permutation is taken into consideration and the maximum

matched percentage score is chosen. And then, a list of company name is generated based on

the order of the percentage similarity score.

20

4.2.4 Detailed Class Diagram

Figure 5 Detailed Class Diagram

21

The system is implemented by using the object oriented methodology. We have not used

Framework of any kind. Some of the core classes of system along with their association is

shown.

Comparison System

This system is used to compare the result received from preprocessing engine of user input

and list obtained from database

1. HungarianAlgorithmEdu Class

In this class we have used Hungarian algorithm to compute the highest possible score

of matching between the tokens from both input. The input to this system is the

weight matrix obtained from Hybrid Class and the output will be the similarity score.

hgAlgorithm() method performs the Hungarian algorithm and final similarity score

is returned by getScore() method.

2. Hybrid Class

In this class we have used Levenshtein Distance algorithm to calculate the edit

distance. This class calculates edit distance between two tokens of strings and finally

gives the similarity score between them. ComputeDistance() method computes the

edit distance and GetSimilarity() returns the simalirity between tokens.

3. Permutation Class

In this class we perform permutation of the result obtained from transliteration of

user input token and user input token but among the tokens of itself. permute()

method computes permutation operation.

4. MatchsMaker Class

This is the main class of comparison system which calls each of its component to

perform comparision and return output as similarity percentage. GetScore() returns

the similarity percentage and Initialize() initializes necessary components.

22

Database System

1. DatabaseCredentials Class

This class is used to store database credentials. Those credentials includes username,

password and connection path. This method can also be used as Java Beans to

implement set/get methods.

2. DatabaseHandler

DatabaseHandler class is used to initiate the database connection and also declaring

the database type.

3. CookSQL Class

This class is used to prepare SQL statements.

4. CompanyNameEnglish Class

This class is the core of the package. This class contains the methods for individual

record manipulation and resultset retrieval.

5. ConnectDatabase

This class is the bridge between database and the main interface and other class. This

class is used to hide the details of the underlying database implementations.

Preprocessing Engine

This engine contains component that is used to downcast, clean, transform. Remove stop

words, stem and tokenize.

1. SpaceProcessor Class

This class is used to tokenize a company name based on space and hyphen (-) and

rejoin the individual tokens if necessary.

getSplittedText():This method is used to split the company name into tokens.

joinSplittedText():This method is join tokens with space to regenerate the company

name.

2. StopwordRemover Class

This class is used to remove the stop words as defined by the OCR directives.

3. Stemmer Class

Stemmer class contains methods to generate root words. Stemming is achieved using

SnowBall stemmer and morphological analysis.

4. SymbolProcessor

This class is used to clean the illegal symbols from names.

23

4.3 Project Tools

Programming Language: Java SE 7

Database: MySQL Server Version 5.1.41

Testing: JUnit testing

Drawings: MS Paint, MS Visio, ArgoUML ,Adobe Photoshop

Documentation: MS Word/Excel/PowerPoint

Platform: Windows

IDE: Eclipse Indigo

4.4 Eclipse as Programming IDE

Eclipse was used as IDE for project development. Eclipse is a multi-language software

development platform comprising an IDE and a plug-in system to extend it. It is written

primarily in Java and is used to develop applications in this language and, by means of the

various plug-ins, in other languages as wellC/C++, COBOL, Python, Perl, PHP and more.

The initial codebase originated from Visual Age. In its default form it is meant for Java

developers, consisting of the Java Development Tools (JDT). Users can extend its

capabilities by installing plug-ins written for the Eclipse software framework, such as

development toolkits for other programming languages, and can write and contribute their

own plug-in modules. Language packs provide translations into over a dozen natural

languages. Released under the terms of the Eclipse Public License, Eclipse is free and open

source software.

4.5 MySQL as Database System

MySQL was used as database server. It is a relational database management system

(RDBMS) which has more than 11 million installations. The program runs as a server

providing multi-user access to a number of databases. The project's source code is available

under terms of the GNU General Public License, as well as under a variety of proprietary

agreements.

24

Chapter 5

EXPERIMENTAL SETUP

Hardware Configuration used for Testing

Hardware Configuration:

Computer Model: DELL 5110

Physical Memory (RAM): 4.00 GB, DDR2

Processor: Intel(R) Core(TM) i-5-2450M CPU, 2.5 GHz

System Type: 64-bit Operating System, x64-based processor

Cache Size: 4096 KB

OS: Windows 8 Enterprise


Database with 111,161 records of company names.

Computer Model: Acer Aspire E1-531

Physical Memory (RAM): 4.00 GB, DDR2

Processor: Intel B960 Dual Core processor (2.2 Ghz, 2MB L3 cache)

System Type: 64-bit Operating System, x64-based processor

Cache Size: 4096 KB

OS: Windows 8 Enterprise


Database with 111,161 records of company names.

25

Chapter 6

OUTPUT

1. Output obtained by using input durga enterprises

2. Output obtained by using input hamro lagani

3. Output obtained by using input jagadamba steels

4. Output obtained by using input nawayug vidhya niketan kanchanpur

Figure 6 Example - I

Figure 7 Example - II

Figure 8 Example- III

Figure 9 Example - IV

26

5. Output obtained by using input nepal investment company

6. Output obtained by using input nepal one travels and tour

7. Output obtained by using input new age business consultant

Figure 10 Example- V

Figure 11 Example - VI

Figure 12 Example - VII

27

Chapter 7

RESULT AND ANALYSIS

To obtain the similarity scores, we tried various similarity measuring algorithms. However

Levenshtein Algorithm and Hungarian Algorithm together with Sorensen Algorithm seemed

to fit our need. We used various processes before applying these algorithms which proved to

be fruitful. The scores obtained is saved in file having .csv extension. Stemming was used to

obtain dictionary based root words. Tokenization and transliteration was used to obtain the

tokens later used in the comparison process. We used translation and transliteration to cope

with Nepalese words. The accuracy was accessed by trying different names that can be used

in reality.

The computation time depends upon the number of tokens to be compared and for now, the

system is single threaded.

Figure 13 Computation Time with Transformation

Figure 6 shows the relation between number of tokens and time to compute similarity scores

with various generations of Intel Processors. The computational time is more in lower

generation of processors and less in higher generation of processor. Furthermore, more is the

tokens greater is the computation time. This result is obtained without the use of

transformation process.

1.179 1.4342.395

5.9395.384

7.316

22.743

53.785

0

10

20

30

40

50

60

1 Token (DurgaEnterprises)

2 tokens (jagadamba steelspvt.ltd)

3 tokens (New AgeBusinness Consultant

Limited)

4 tokens (Nepal Onetravels and tours Ltd.)

Tim

e to

Co

mp

ute

(se

c)

Number of Tokens

Number of Tokens VS Computation Time

Time to compute (sec) in I5 CPU

Time to compute (sec) in Dual Core CPU

28

Figure 14 Time Computation with Transformation

Figure 7 shows the result obtained by using Transformation process. It takes more time with

using transformation, but it yields better results. By using appropriate hardware resources,

we can reduce this time within the constraint.

For comparison process, we initially used Cosine similarity algorithm. But it didnt yield

promising results. Cosine similarity algorithm doesnt consider about the relative position of

alphabets in the string, it only considers the repetition of alphabets. Thus a string with

different spelling but same alphabet count is considered similar. This resulted in severe

limitation of its use.

Levenshtein algorithm proved useful in our project. It considers the position of alphabets in

a string which is necessary for our system. This algorithm along with Hungarian Algorithm

resulted in the satisfactory results. To obtain the final score we used Sorensen coefficient. Its

value lies in the range [0, 1]. Multiplying this coefficient by 100 gave us the final

percentage score.

1.664 2.204

11.952

37.743

8.95913.315

39.994

107.498

0

20

40

60

80

100

120

1 Token (DurgaEnterprises)

2 tokens (jagadambasteels pvt.ltd)

3 tokens (New AgeBusinness Consultant

Limited)

4 tokens (Nepal Onetravels and tours Ltd.)

Tim

e to

Co

mp

ute

(se

c)

Number of Tokens

Number of Tokens VS Computation Time

Time to compute (sec) in I5 CPU

Time to compute (sec) in Dual Core CPU

29

Chapter 8

CONCLUSION AND FURTHER ENHANCEMENT

7.1 Conclusion

With all the accumulated effort invested in this project, there are reasons to believe that at

the end of this semester this project will find itself in a much better shape and quite closer to

actual acceptance than it was. We summarize the progress with respect to the main objectives

of the project, namely, accuracy and speed.

Accuracy: This is the main obstacle for the project. We have been constantly using

and testing many different algorithms for similarity comparison. However we have

been able to get satisfactory results using Levenshtein distance and Hungarian

Method in conjunction with Sorensen Coefficient. We are further trying to improve

the results by employing many other algorithms Phonetic (Double Metaphone) and

using transformation function.

Speed: Speed is also a challenging factor for this project. The requirement for shorter

processing time has made it difficult to balance between accuracy and speed.

However by using the processing capability of MySQL, we have been able to

improve the speed resulting in shorter waiting time for the users. The use of adequate

data structures have been of prominent advantage.

Let us remark that one of the apparent major obstacles for gaining acceptance for this

project lies in the standards of the Office of Company Registrar.

7.2 Limitations

Our System comprises of the following limitations.

The system cannot process name having numbers as prefix or suffix.

Preprocessing Engine have many limitations. Stemming sometimes produces

incorrect results if the input is the Nepali word. E.g. Spat () in Nepali (Steel in

English) may result in spit due to morphology based stemming. In such cases,

similarity matching reduces.

Dictionary (English-Nepali) does not contain enough words. There are many English

words for which Nepali word is not available

Transformation process results in more computational time.

30

Synonyms are not considered in the system.

Strings such as papermill and paper mill, though similar, are considered different

because of the space. The space results in two tokens. Although both strings have

same meaning, they are not considered similar by our system.

7.3 Further Enhancement

There is a great opportunity to enhance this project in upcoming future. The Similarity

Checking algorithm has the greatest possibility of being enhanced. If phonetic based

similarity measures is incorporated, accuracy can be greatly improved. Implementing faster

searching methods can greatly enhance the performance of the system.

Use of Taxonomy for classifying the tokens further with similarity measures can help

accurately validate purposed names. Taxonomy can classify the context of names and thus

improve the validation process.

Furthermore, using some weighing measures to assign weights to most common words might

be helpful in increasing accuracy of the similarity score.

31

REFERENCE

[1] Office of Company Registrar, Nepal. Retrieved from: www.ocr.gov.np. Date Retrieved:

07/04/2013

[2] Companies House. Retrieved from:

http://wck2.companieshouse.gov.uk//wcframe?name =accessCompanyInfo. Date

Retrieved : 04/07/2013

[3] Companies and Intellectual Property Commission. Retrieved from:

http://www.cipc.co.za/.

Date Retrieved: 04/07/2013

[4] Anne Kao and Stephen R. Poteet (Eds). Natural Language Processing and Text Mining.

Springer 2006

[5] Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online

Applications .In Prof. Ruslan Mitkov, editor. John Benjamins Publishing Company,2002

[6] Ronan Collobert, JasonWeston, Leon Bottou, et al. Natural Language Processing

(Almost) from Scratch. Editor. Michael Collins. NEC Laboratories America, 4

Independence Way, Princeton, NJ 08540

[7] Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman. Natural language

processing: an introduction. Available from: group.bmj.com

[8] Chris Manning, Hinrich Schtze. Foundations of Statistical Natural Language

Processing. MIT Press. Cambridge, MA: May 1999. Available from:

http://nlp.stanford.edu/fsnlp/

[9] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI

2001. Available from http://www.ebooksdirectory.com/details.php?ebook=6774

[10] Danil de Kok, Harm Brouwer. Natural Language Processing for the Working

Programmer. 2011. Available from : http://nlpwp.org/book/

[11] Aliseda, R. van Glabbeek, D. Westerstahl. Computing Natural Language. CSLI

1998. Available from: http://www.e-booksdirectory.com/details.php?ebook=3940

[12] Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with

Python.

O'Reilly Media 2009. Available from:

http://www.ebooksdirectory.com/details.php?ebook=7184

32

[13] Rob Malouf, Miles Osborne. An Introduction to Stochastic Attribute-Value

Grammars. ESSLLI 2001.Available from:

http://www.e-booksdirectory.com/details.php?ebook=6860

[14] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI

2001.Available from: http://www.e-booksdirectory.com/details.php?ebook=6774

[15] Grosz, B.J. Jones, K.S.Webber, B.L. Readings in Natural Language Processing.

Kaufman Publishers Inc.,Los Altos, CA. Available from:

http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=6537037

[16] Reilly, Ronan G. (Ed); Sharkey, Noel E. (Ed). Connectionist approaches to natural

language processing. Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc. 1992.

Available from: http://psycnet.apa.org/psycinfo/1992-98664-000

[17] C. Friedman and R. Sideli. Tolerating spelling errors during patient validation.

Computers and Biomedical Research, 25:486509, 1992.

[18] F. Patman and P. Thompson. Names: A new frontier in text mining. In ISI-2003,

Springer LNCS 2665, pages 2738.

[19] Simon J. Greenhill. Computational Linguistics Volume 37 Issue 4, December 2011,

pages 689-698.

[20] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string

similarity measures. In Proceedings of ACM SIGKDD, pages 3948, Washington DC,

2003.

[21] C. L. Borgman and S. L. Siegfried. Gettys synonameTM and its cousins: A survey

of applications of personal name matching

[22] Algorithms. Journal of the American Society for Information Science, 43(7):459

476, 1992.

[23] P. Christen, T. Churches, and M. Hegland. Febrl a parallel open source data linkage

system. In PAKDD, Springer LNAI

[24] 3056, pages 638647, Sydney, 2004.

[25] P. Christen and K. Goiser. Quality and complexity measures for data linkage and

deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining,

Studies in Computational Intelligence. Springer, 2006.

[26] P. A. Hall and G. R. Dowling. Approximate string matching. ACM Computing

Surveys, 12(4):381402, 1980. [25] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison

33

of approximate string matching algorithms. Software Practice and Experience,

26(12):14391458, 1996.

[27] R. Gong and T. K. Chan. Syllable alignment: A novel model for phonetic string

search. IEICE Transactions on Information and Systems, E89-D(1):332339, 2006.

[28] F. J. Damerau. A technique for computer detection and correction of spelling errors.

Communications of the ACM, 7(3):171176, 1964.

[29] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys,

33(1):3188, 2001.

34

APPENDIX A: Gantt chart

Figure 15 Gantt Chart

35

APPENDIX B: Use Case

Figure 16 Use Case Diagram

36

APPENDIX C: Preprocessing Detail Example

For User Input

Methodology Example Nepal Metals Industries Process

Pre

pro

cess

ing E

ngin

e

Downcasting nepal metal industries Conversion of input to lowercase.

Transformation Not Applied in this example Conversion of British English words to American English.

Stopword

Removal

nepal metals Removal of Stop words Company, Industry, and Pvt.Ltd.as mentioned in the draft.

Tokenization [nepal,metals] Extraction of Tokens

Stemming [nepal,metal] Reduction to Root words.

Translation [nepal,metal] to [ , ] Conversion of tokens from English to Nepali.

Transliteration [ , ] to [nepal , dhatu ] Conversion of Nepali Unicode.

Generated Keywords (Using Transliterated Token + Stemmed Token)[ nepal , metal , dhatu ]

Query to MySQL Database resulting in a list of company names.

Company Name Extraction

from Query

Example(Randomly choosen)

Royal Metal Nepal Pvt.Ltd. Process

Pre

pro

cess

ing

En

gin

e

Downcasting royal metal nepal pvt.ltd. Conversion of input to lowercase.


Stopword

Removal

royal metal nepal Removal of Stop words Company, Industry, and Pvt.Ltd.as mentioned in the draft.

Tokenization [royal , metal , nepal ] Extraction of Tokens

Stemming [royal , metal , nepal ] Reduction to Root words.

Database Generated Keywords

[ royal , metal , nepal ]

Comparison-1 (User Input Generated Keywords & Database Generated Keywords.)

Company Name Extraction

from Query

Example(Randomly choosen)

Nepal Dhatu Industries Process

Pre

pro

cess

ing

En

gin

e

Downcasting nepal dhatu industries Conversion of input to lowercase.


Stopword

Removal

nepal dhatu Removal of Stop words Company, Industry, and Pvt.Ltd. as mentioned in the draft.

Tokenization [nepal , dhatu ] Extraction of Tokens

Stemming [nepal , dhatu ] Reduction to Root words.

Database Generated Keywords

[nepal , dhatu ]

Comparison-2 (User Input Generated Keywords & Database Generated Keywords.)

37

APPENDIX D: Comparison Detail

Figure 17 Comparison I (Part A)

38

Figure 18 Comparison I (Part B)

39

Figure 19 Comparison II (Part A)

40

Figure 20 Comparison II (Part B)

41

APPENDIX E: Output Screenshot

Figure 21 Output Screenshot

42

APPENDIX F: Data Flow Diagram

Figure 22 Data Flow Diagram

43

APPENDIX G: Theory

Hungarian Algorithm

Hungarian Method is for assigning jobs by a one-for-one matching to identify the lowest-

cost solution. Each job must be assigned to only one machine. It is assumed that every

machine is capable of handling every job, and that the costs or values associated with each

assignment combination are known and fixed. The number of rows and columns must be the

same. The algorithm is as follows.

1. Arrange the information in a matrix form with String 1 and String 2 on left and along the

top with the Levenshtein distance for each pair in the middle.

2. Ensure that the matrix is a square by addition of the dummy rows/columns if necessary.

Conventionally, each element in the dummy row/column is the same as the largest

number in the matrix.

3. Reduce the rows by subtracting the minimum value of each row from that row.

4. Reduce the columns by subtracting the minimum value of each column from that column.

5. Cover the zero elements with the minimum number of lines it is possible to cover them

with.(if the number of lines is equal to the number of rows then goto step 9)

6. Add the minimum uncovered element to every covered element, if an element is covered

twice, add the minimum element to it twice.

7. Subtract the minimum element from every element in the matrix.

8. Cover the zero elements again. If the number of lines covering the zero elements is not

equal to the number of rows, return to step 6.

9. Select a matching by choosing a set of zeros as that each row or column has only one

selected.

10. Apply the matching to the original matrix, disregarding dummy rows.

44

Procedure of Metaphone Phonetic Algorithm

Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY.[2] The

'0' represents "th" (as an ASCII approximation of ), 'X' represents "sh" or "ch", and the

others represent their usual English pronunciations. The vowels AEIOU are also used, but

only at the beginning of the code.[3] This table summarizes most of the rules in the original

implementation:

1. Drop duplicate adjacent letters, except for C.

2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.

3. Drop 'B' if after 'M' at the end of the word.

4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-

', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'.

Otherwise, 'C' transforms to 'K'.

5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.

6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if

followed by 'N' or 'NED' and is at the end.

7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G'

transforms to 'K'.

8. Drop 'H' if after vowel and not before a vowel.

9. 'CK' transforms to 'K'.

10. 'PH' transforms to 'F'.

11. 'Q' transforms to 'K'.

12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.

13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if

followed by 'CH'.

14. 'V' transforms to 'F'.

15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.

16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.

17. Drop 'Y' if not followed by a vowel.

18. 'Z' transforms to 'S'.

19. Drop all vowels unless it is the beginning.

Documents

Final Report 3