Final Report 3

Embed Size (px)

DESCRIPTION

sdf

Citation preview

  • TRIBHUVAN UNIVERSITY

    INSTITUTE OF ENGINEERING

    PULCHOWK CAMPUS

    A

    FINAL YEAR PROJECT REPORT

    ON

    Name Conflict Resolution for Company Registration

    By:

    Gaurav Kumar Goyal (16214)

    Janardan Chaudhary (16216)

    Nimesh Mishra (16221)

    Sanat Maharjan (16230)

    A PROJECT SUBMITTED TO THE DEPARTMENT OF ELECTRONICS

    AND COMPUTER ENGINEERING IN PARTIAL FULLFILMENT OF

    THE REQUIREMENT FOR THE BACHELORS DEGREE IN COMPUTER ENGINEERING

    DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINNERING

    LALITPUR, NEPAL

    AUGUST, 2013

  • i

    INSTITUTE OF ENGINEERING

    PULCHOWK CAMPUS

    DEPARTMENT OF ELECTRONICS AND COMPUTER ENGINEERING

    The undersigned certify that they have read, and recommended to the Institute of Engineering

    for final submission and presentation of the project entitled "Name Conflict Resolution for

    Company Registration" submitted by Gaurav Kumar Goyal, Janardan Chaudhary, Nimesh

    Mishra and Sanat Maharjan in partial fulfilment of the requirements for the Bachelors

    degree in Computer Engineering.

    _________________________________________________

    Supervisor, Prof. Dr. Shashidhar Ram Joshi

    Department of Electronics and Computer Engineering

    _________________________________________________

    Co-Supervisor, Er. Sansar Jung Dewan

    IT Officer, Office of Company Registrar (OCR)

    __________________________________________________

    Internal Examiner, Baburam Dawadi

    Department of Electronics and Computer Engineering

    __________________________________________________

    External Examiner, Anjesh Tuladhar

    COO, Young Innovations Pvt. Ltd.

    DATE OF APPROVAL: 25 Aug. 2013

  • ii

    COPYRIGHT The author has agreed that the Library, Department of Electronics and Computer

    Engineering, Pulchowk Campus, Institute of Engineering may make this report freely

    available for inspection. Moreover, the author has agreed that permission for extensive

    copying of this project report for scholarly purpose may be granted by the supervisors who

    supervised the project work recorded herein or, in their absence, by the Head of the

    Department wherein the project report was done. It is understood that the recognition will be

    given to the author of this report and to the Department of Electronics and Computer

    Engineering, Pulchowk Campus, Institute of Engineering in any use of the material of this

    project report. Copying or publication or the other use of this report for financial gain without

    approval of to the Department of Electronics and Computer Engineering, Pulchowk Campus,

    Institute of Engineering and authors written permission is prohibited.

    Request for permission to copy or to make any other use of the material in this report in

    whole or in part should be addressed to:

    Arun Timilsina, PhD/ Professor

    Head of Department

    Department of Electronics and Computer Engineering

    Pulchowk Campus, Institute of Engineering

    Lalitpur, Kathmandu

    Nepal

  • iii

    ACKNOWLEDGEMENT

    First of all, we would like to express our sincere gratitude towards Department of Electronics

    and Computer, Pulchowk Campus for including final year major project as part of our

    syllabus for final year B.E. in Computer. We would like to extend our gratitude towards Dr.

    Arun Timilsina, Head of Department, Electronics and Computer Engineering, for assisting

    us in our project.

    We would like to take the privilege to express our gratitude towards Prof. Dr. Shashidhar

    Ram Joshi for being our project supervisor.

    We would also like to thank Dr. Aman Shakya for his support and assistance. We are deeply

    indebted to Er. Sansar Jung Dewan of Office of Company Registrar and the Office of

    company Registrar itself for giving us an opportunity to do this project with enormous

    scopes.

    We would also like to express our sincere thanks to Mr. Bal Krishna Bal, Assistant professor,

    Department of Electronics and Computer Engineering, Kathmandu University, for this help

    and support.

    Last but not the least we would like to thanks our friends and classmates for their help and

    valuable suggestions.

  • iv

    ABSTRACT

    Natural language processing is one of the most researched field. One of the applications of

    natural language processing is determining similarity of sentences. Naming conflict

    resolution is about comparing of words. There are many systems developed for this purpose

    and are used widely.

    In context of Nepal, the existing system for resolving naming conflicts during registration of

    a company is done manually (by human). However, there exists requirement for automation

    of the process. The automation requires natural language processing, translation of

    languages, transliteration between languages. There are several constraints for the checking

    provided by the Office of Company Registrar (OCR). These constraints should be considered

    while comparing words. The words need to be tokenized, stemmed before they can be further

    processed.

    Keywords:

    OCR, Morphological Analysis, Similarity Matching, Natural Language Processing.

  • v

    TABLE OF CONTENT COPYRIGHT .................................................................................................................... ii

    ACKNOWLEDGEMENT ................................................................................................ iii

    ABSTRACT ..................................................................................................................... iv

    TABLE OF CONTENT ..................................................................................................... v

    TABLE OF FIGURES .................................................................................................... viii

    Chapter 1 ........................................................................................................................... 2

    INTRODUCTION .......................................................................................................................... 2

    1.1 Background....................................................................................................................... 2

    1.2 Motivation ........................................................................................................................ 3

    1.3 Problem Statement ........................................................................................................... 3

    1.4 Objectives ......................................................................................................................... 4

    1.5 Scope of the work ............................................................................................................. 4

    Chapter 2 ........................................................................................................................... 5

    LITERATURE REVIEW .................................................................................................................. 5

    2.1 Introduction...................................................................................................................... 5

    2.2 Common processes used in text similarity......................................................................... 5

    2.2.1 Downcasting .............................................................................................................. 5

    2.2.2 Transformation .......................................................................................................... 5

    2.2.3 Stopword Removal ..................................................................................................... 5

    2.2.4 Tokenization .............................................................................................................. 6

    2.2.5 Stemming .................................................................................................................. 6

    2.3 Existing Name checking Systems ....................................................................................... 6

    2.4 Criteria defined by OCR ..................................................................................................... 7

    2.5 Matching Techniques ........................................................................................................ 8

    2.5.1 Phonetic encoding ..................................................................................................... 8

    2.5.1.1 Soundex .............................................................................................................. 8

    2.5.1.2 Metaphone ......................................................................................................... 9

    2.5.2 Pattern matching ....................................................................................................... 9

    2.5.2.1 Levenshtein or Edit Distance.............................................................................. 10

    2.5.2.2 Sorenson similarity ............................................................................................ 10

    2.5.2.3 Cosine Similarity ................................................................................................ 11

    2.6 Summary ........................................................................................................................ 11

    Chapter 3 ......................................................................................................................... 12

    REQUIREMENT ANALYSIS ......................................................................................................... 12

  • vi

    3.1 Functional Requirements ................................................................................................ 12

    3.2 Non-Functional Requirements ........................................................................................ 12

    3.2.1 Reliability ................................................................................................................. 12

    3.2.2 Performance ............................................................................................................ 12

    3.2.3 Accuracy .................................................................................................................. 13

    Chapter 4 ......................................................................................................................... 14

    METHODOLOGY ....................................................................................................................... 14

    4.1 Introduction.................................................................................................................... 14

    4.2 System Design ................................................................................................................ 15

    4.2.1 Flow Diagram ........................................................................................................... 16

    4.2.2 Deployment Diagram ............................................................................................... 17

    4.2.3 System Architecture ................................................................................................. 18

    4.2.3.1 Preprocessing Engine ........................................................................................ 19

    4.2.3.2 Translation and Transliteration .......................................................................... 20

    4.2.3.3 Possible Keyword Generation ............................................................................ 21

    4.2.3.5 Ranking ............................................................................................................. 22

    4.2.4 Detailed Class Diagram ............................................................................................. 20

    4.3 Project Tools ................................................................................................................... 23

    4.4 Eclipse as Programming IDE ............................................................................................ 23

    4.5 MySQL as Database System ............................................................................................ 23

    Chapter 5 ......................................................................................................................... 24

    EXPERIMENTAL SETUP .............................................................................................................. 24

    Chapter 6 ......................................................................................................................... 25

    OUTPUT ................................................................................................................................... 25

    Chapter 7 ......................................................................................................................... 27

    RESULT AND ANALYSIS ............................................................................................................. 27

    Chapter 8 ......................................................................................................................... 29

    CONCLUSION AND FURTHER ENHANCEMENT ........................................................................... 29

    7.1 Conclusion ...................................................................................................................... 29

    7.2 Limitations ...................................................................................................................... 29

    7.3 Further Enhancement ..................................................................................................... 30

    REFERENCE .................................................................................................................. 31

    APPENDIX A: Gantt chart .............................................................................................. 34

    APPENDIX B: Use Case ................................................................................................. 35

    APPENDIX C: Preprocessing Detail Example ................................................................. 36

  • vii

    APPENDIX D: Comparison Detail .................................................................................. 37

    APPENDIX E: Output Screenshot ................................................................................... 41

    APPENDIX F: Data Flow Diagram ................................................................................. 42

    APPENDIX G: Theory .................................................................................................... 43

  • viii

    TABLE OF FIGURES

    Figure 1 Flow Chart ...................................................................................................................... 16

    Figure 2 Deployment Diagram ...................................................................................................... 17

    Figure 3 System Architecture........................................................................................................ 18

    Figure 4 Preprocessing Engine ...................................................................................................... 19

    Figure 5 Detailed Class Diagram ................................................................................................... 20

    Figure 6 Example - I ...................................................................................................................... 25

    Figure 7 Example - II ..................................................................................................................... 25

    Figure 8 Example- III ..................................................................................................................... 25

    Figure 9 Example - IV .................................................................................................................... 25

    Figure 10 Example- V .................................................................................................................... 26

    Figure 11 Example - VI .................................................................................................................. 26

    Figure 12 Example - VII ................................................................................................................. 26

    Figure 13 Computation Time with Transformation ....................................................................... 27

    Figure 14 Time Computation with Transformation ....................................................................... 28

    Figure 15 Gantt Chart ................................................................................................................... 34

    Figure 16 Use Case Diagram ........................................................................................................ 35

    Figure 17 Comparison I (Part A) .................................................................................................. 37

    Figure 18 Comparison I (Part B)................................................................................................... 38

    Figure 19 Comparison II (Part A) ................................................................................................. 39

    Figure 20 Comparison II (Part B) ................................................................................................. 40

    Figure 21 Output Screenshot ........................................................................................................ 41

    Figure 22 Data Flow Diagram........................................................................................................ 42

  • 2

    Chapter 1

    INTRODUCTION

    1.1 Background

    Trying to understand language as a unit in machine terms is not as easy as it is thought.

    Words are perhaps the most intuitive units of language, yet they are in general tricky to

    define. Words are defined in most languages as the smallest linguistic units that can form a

    complete utterance by themselves. Natural language processing deals with the ambiguity in

    word processing.

    The office of company registrar is responsible for maintaining law and order regarding

    different companies. Almost all of the daily task of the office used to be manual, now the

    OCR has moved ahead for the automation of tasks using computerized systems. Before the

    advent of current online system, the process relating to change, admission, and removal of

    company names used to be difficult and cumbersome. Even after the recent development of

    online system of the office, the system is isn't intelligent enough. Currently the Office of

    Company Registrar (OCR) has implemented database entity comparison features. The

    process of finding company names is often based on English names. Comparison features is

    however limited to entity to entity match and phonetic based matching. The existing system

    often fails to act responsively and accurately during the process related to a new company

    registration. The current system is severely limited due to the above mentioned comparison

    method. The same problem arises while a new company tries to reserve their company name.

    Naming conflict resolution system for company registration is a system that finds the

    similarity between the proposed name of a company and existing company names in

    database. This requires the use of some of the traits of natural language processing. First of

    all, the input is down casted and stop-words are removed from the proposed name. The name

    is then transformed, tokenized, stemmed to determine the root words used in similarity

    checking. The words are then used to form some of probable tokens using translation and

    transliteration process. These names are then matched with words from database to form the

    ranking of similar names.

    The system requires to translate Nepali words to English words and vice-versa. The

    translation is done with the help of dictionaries. The removal of stop-word requires pool of

    pre-defined words to be removed. The constraints are defined by the Office of Company

    Registrar. These constraints include use of plural words, case sensitivity, punctuation and

  • 3

    spacing in the names, use of numbers, different phonetic spellings or spelling variations and

    many others. The system will also assist in decision making process, whether or not to

    approve the proposed name. This system will result in efficient processing, and faster

    registration of names.

    1.2 Motivation

    Almost all of the daily task of the office used to be done manually. But now the OCR has

    moved ahead for the automation of tasks using computerized systems. Before the advent of

    current online system, the process relating to change, admission, and removal of company

    names used to be difficult and cumbersome. Even after the recent development of online

    system of the office, the system is isn't intelligent enough. Currently the Office of Company

    Registrar (OCR) has implemented database entity comparison features. The process of

    finding company names is often based on English names. Comparison features is however

    limited to entity to entity match and phonetic based matching. The existing system often fails

    to act responsively and accurately during the process related to a new company registration.

    The current system is severely limited due to the above mentioned comparison method.

    These limitations in current system motivated us to develop a more reliable and accurate

    system based on String Matching Algorithms, which produces more accurate results than the

    Phonetic based string matching approach currently used.

    1.3 Problem Statement

    A recent improvement in the registration of new companies is the addition of the online

    registration and name checking system. However, the current name checking system faces

    from lack of accuracy and drawbacks of matching names regarding to their phonetic

    pronunciation.

    In our current project, we try to build a system that checks the validity of the purposed names

    by using string matching schemes rather than phonetic. Our objective is to determine that

    extent to which the purposed name is similar to existing name , and based on this we

    determine whether the name is available for registration .

  • 4

    1.4 Objectives

    The main objective of the project is to develop a system capable of checking the similarity

    of the purposed company names with registered ones. The objectives can be further be

    simplified as:

    1. To develop a system to resolve naming conflict.

    2. To find names similar to the name proposed by user.

    3. To provide the ranks of matched proposed name with other existing names.

    4. To define the threshold level used to validate name

    1.5 Scope of the work

    Name checking system is used in many countries to check the purposed name of a company.

    Variety of approaches is available to develop such name checking system. The approach

    used here is NLP approach. The system will be able to check the purposed name with much

    better accuracy than the current system. This system will be beneficial to the clients and the

    OCR. This system is based on research along with study and analysis of existing system. The

    system will produce output in the form of .csv file containing the similarity scores of various

    names with the purposed name.

  • 5

    Chapter 2

    LITERATURE REVIEW

    2.1 Introduction

    This project is all about checking the validity of the purposed company names for the Office

    of Company Registrar. One of the important steps while developing such a system is to

    examine all the research areas thoroughly. It is important to know about Natural Language

    Processing in order to know about the processes used in this project. Also for designing this

    system, existing systems are studied thoroughly.

    Natural Language Processing (NLP) is a branch of information machine science that deals

    with natural language information. NLP is a component of artificial intelligence. NLP is a

    form of human-to-computer interaction where the elements of human language, be it spoken

    or written, are formalized so that a computer can perform value-adding tasks based on that

    interaction. Human language is dauntingly complex for a computer to understand. NLP is

    used in various areas like language translation, speech processing, checking for grammatical

    errors, etc.

    2.2 Common processes used in text similarity

    It is always useful to know about different types of processes used for NLP. Some of the

    common processes are mentioned below:

    2.2.1 Downcasting

    Downcasting also referred as type refinement is act of casting script from uppercase

    letters to lowercases. It is done so as to make sure there is no conflict in company names

    due to uppercase letters between the words to make it a unique name.

    2.2.2 Transformation

    Transformation is the conversion of words from British English word to that to American

    English words. Transformation is done to avoid the generation of unwanted keywords or

    conflicting keywords

    2.2.3 Stopword Removal

    Stop word removal is the process of removing some predefined stop words from the

    string literal. We used this process to remove the words that are considered

    similar/unimportant defined by Office of the Company Registrar directives.

  • 6

    2.2.4 Tokenization

    Tokenization is the process of breaking up a string into tokens to be indexed using

    predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries

    can be a pool of predefined words or bilingual English-Nepali dictionary.

    2.2.5 Stemming

    Stemming is the process of reducing a word to a root, or simpler form which are present

    in plural forms. Stemming is often used in text processing applications. There are many

    different approaches to stemming, each with their own design goals. Some are

    aggressive, reducing words to the smallest root possible.

    2.3 Existing Name checking Systems

    In order to develop an effective name checking system, it is important to study many similar

    existing systems so that the system to be developed covers some of the deficiencies of these

    systems. We mainly focused on the existing system used in OCR Nepal. A name checking

    system takes the name purposed by the customer and compares with the similar already

    existing names. Based on the results, it determines if the name is allowed to be registered.

    1. Office of Company Registrar, Nepal

    This system uses Phonetic algorithms to check the names. The customer has to visit

    the homepage of the OCR [1] and enter the purposed name. The system checks this

    name with already existing names and determines if the name is valid. The existing

    system however faces the problem of lack of accuracy.

    2. Companies House, United Kingdom

    This system is used by the government of United Kingdom to check the purposed

    name. The client can visit the website [2] and check for the name intentioned. The

    system returns the list of existing similar names.

    3. CIPC

    CIPC stands for Companies and Intellectual Properties Commission. It is a system

    that checks the availability of the name purposed by the customer. The client can visit

    the website [3] register by paying the fee and then check his/her intentioned company

    name. The CIPC will check the name against existing registered businesses and reject

    the names that are too similar. The system will also check if the name is reserved or

    not.

  • 7

    2.4 Criteria defined by OCR

    In approving a proposed name of company, the following shall not be considered different or

    distinguishable:

    1. The words Private, Pvt., (P), Limited, Ltd, Ltd., Limited Liability.

    2. The words appearing at the end of the names company, and company, co., co.

    3. The plural version of any of the words appearing in the name.

    4. The type and case of letters, spacing between letters and punctuation marks;

    5. Joining words together or separating the words, as this does not make a name

    distinguishable from a name that uses the similar, separated or joined words. For

    example: Him Shikhar Travels Pvt. Ltd. will be considered as similar to Himshikhar

    Travel.

    6. The use of number of the same word and (the use of tense in English), as this does not

    distinguish one name from another. Such as, Three Six Five Tours and Travels Pvt.

    Ltd. will be to 365 Tours and travels Pvt. Ltd.

    7. Using different phonetic spellings or spelling variations, as this does not distinguish

    one name from another. For example, S.D. Enterprises limited is existing then S and

    D Enterprises or Satya Darshan Enterprises will not be allowed.

    8. Similarly if a name contains numeric character like 3, 6, and 7 resemblance shall be

    checked with Three, six, and seven.

    9. The use of an internet related designation, such as .COM, .NET, .EDU, GOV, .ORG,

    .IN, as this does not make a name distinguishable from another.

    10. The addition of words like New, Modern, Nav, Shri, Sri, Shree, Sree, Om, Jai, Sai,

    The, etc., as this does not make a name distinguishable from an existing name such

    as New Kantipur Publication Pvt., Shree Sai Enterprises.

    11. The adding the name of the place like Kathmandu, Janakpur as this does not make a

    name different or distinguishable. For example, Kathmandu Sugam Pharmaceuticals

    Private Ltd. cannot be allowed if Sugam Pharmaceuticals Private Ltd already

    exists;Such names may be allowed only if no objection from the existing company

    by way of Board resolution is produced/ submitted.

    12. Different combination of the same words, as this does not make a name

    distinguishable from an existing name, e.g., if there is a company in existence by the

  • 8

    name of Builders and Contractors Limited, the name Contractors and Builders

    Limited should not be allowed.

    13. Exact Nepali translation of the name of an existing company in English or other

    language. For example, Kathmandu Dairy Industry Limited will not be allowed if

    there exists a company with name Kathmandu Dugdh Udyog Limited.

    2.5 Matching Techniques

    Name matching can be defined as the process of determining whether two name strings are

    instances of the same name [18]. As name variations and errors are quite common [17], exact

    name comparison will not result in good matching quality. Rather, an approximate measure

    of how similar to names are is desired. Generally, a normalized similarity measure between

    1.0 (two names are identical) and 0.0 (two names are totally different) is used.

    The two main approaches for matching names are phonetic encoding and pattern matching.

    Different techniques have been developed for both approaches, and several techniques

    combine the two with the aim to improve the matching quality.

    2.5.1 Phonetic encoding

    Common to all phonetic encoding techniques is that they attempt to convert a string into a

    code according to how a string is pronounced (i.e. the way a string is spoken).

    Naturally, this process is language dependent. Most techniques have been developed mainly

    with English in mind.

    2.5.1.1 Soundex

    Soundex based on English language pronunciation, is the and best known phonetic encoding

    algorithm. It keeps the first letter in a string and converts the rest into numbers according to

    the following encoding table.

    a,e,h,i,o,u,w,y

    b,f,p,v

    c,g,j,k,q,s,x,z

    d,t

    l

    m,n

    r

    0

    1

    2

    3

    4

    5

    6

  • 9

    All zeros (vowels and h, w and y) are then removed and sequences of the same number

    are reduced to one only (e.g. 333 is replaced with 3). The final code is the original first

    letter and three numbers (longer codes are cut-off, and shorter codes are extended with

    zeros). As examples, the Soundex code for peter is p360, while the code for christen is

    c623. A major drawback of Soundex is that it keeps the first letter, thus any error or

    variation at the beginning of a name will result in a different Soundex code.

    2.5.1.2 Metaphone

    Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing

    words by their English pronunciation. It fundamentally improves on the Soundex algorithm

    by using information about variations and inconsistencies in English spelling and

    pronunciation to produce a more accurate encoding, which does a better job of matching

    words and names which sound similar. As with Soundex, similar sounding words should

    share the same keys.

    The original author later produced a new version of the algorithm, which he named Double

    Metaphone. Contrary to the original algorithm whose application is limited to English only,

    this version takes into account spelling peculiarities of a number of other languages. In 2009

    Lawrence Philips released a third version, called Metaphone 3, which achieves an accuracy

    of approximately 99% for English words, non-English words familiar to Americans, and first

    names and family names commonly found in the United States, having been developed

    according to modern engineering standards against a test harness of prepared correct

    encodings.

    2.5.2 Pattern matching

    Pattern matching techniques are commonly used in approximate string matching [24, 25],

    which has widespread applications, from data linkage [22, 23] and duplicate detection [20,

    21], information retrieval [26], correction of spelling errors [27], approximate database joins,

    to bio- and health informatics [25]. These techniques can broadly be classified into edit

    distance and q-gram based techniques, plus several techniques specifically developed for

    name matching.

    A normalized similarity measure between 1.0 (strings are the same) and 0.0 (strings are

    totally different) is usually calculated. We will denote the length of a string s with |s|.

  • 10

    2.5.2.1 Levenshtein or Edit Distance

    The Levenshtein distance [28] is defined to be the smallest number of edit operations

    (insertions, deletions and substitutions) required to change one string into another. In its basic

    form, each edit has cost 1. Using a dynamic programming algorithm [17], the distance

    (number of edits) between two strings s1 and s2 can be calculated in time O(|s1| |s2|) using

    O(min(|s1|, |s2|)) space. The distance can be converted into a similarity measure (between

    0.0 and 1.0) using

    ld (s1,s2)= 1 (s1,s2)

    max(|1|,|2|) (1)

    with ld (s1,s2) being the actual Levenshtein distance function which returns a value of 0

    if the strings are the same or a positive number of edits if they are different. The second

    property allows quick filtering of string pairs that have a large difference in their lengths.

    The distance between "Bob" and "Bob" is zero (0), because no edits are required to convert

    a string into itself. The edit distance between strings is only zero if the strings are identical.

    The distance between "Brett" and "Brent" is one (1), because it requires a substitution of an

    n for a t. The distance between "Brett" and Bret is one, requiring the deletion of one of

    the two t characters in "Brett". The sequence of edits must be minimal, but need not be

    unique. Further note that "Bret" can be converted to "Brett" with a single insertion of a t

    character.

    The distance between "Bob" and "bob" is also 1, as it requires the substitution of a lowercase

    'b' for its uppercase equivalent B.

    Levenshtein Distance is used to calculate the similarity of 2 strings. A standard Levenshtein

    Distance is about ~40% accurate [19], by standardizing the orthography of the strings this

    can be improved to a max of ~65% [3].

    2.5.2.2 Sorenson similarity

    The Sorenson index, also known as Sorensons similarity coefficient, is a statistic used for

    comparing the similarity of two samples. It was developed by the botanist Thorvald Sorenson

    and published in 1948. Sorenson's original formula was intended to be applied to

    presence/absence data, and is

    =2

    +=

    2||

    ||+|| - (2)

  • 11

    where A and B are the number of species in samples A and B, respectively, and C is the

    number of species shared by the two samples; QS is the quotient of similarity and ranges

    from 0 - 1. This expression is easily extended to abundance instead of presence/absence of

    species. The Sorenson index is identical to Dice's coefficient which is always in [0, 1] range.

    2.5.2.3 Cosine Similarity

    The cosine of two vectors can be easily derived by using the Euclidean dot product formula:

    . = |||| - (3)

    Given two vectors of attributes, A and B, the cosine similarity, , is represented using a dot

    product and magnitude as

    = =.

    ||||=

    i i=1

    (i)2=1 (i)2

    =1

    - (4)

    The resulting similarity ranges from 1 meaning exactly opposite, to 1 meaning exactly the

    same, with 0 usually indicating independence, and in-between values indicating intermediate

    similarity or dissimilarity.

    For text matching, the attribute vectors A and B are usually the term frequency vectors of the

    documents. The cosine similarity can be seen as a method of normalizing document length

    during comparison.

    2.6 Summary

    1. The background study focused on the uses of name checking systems, their

    effectiveness and usefulness.

    2. It helped us how to design, methodologies, and programming tools that should be

    used to develop this system.

    3. It also emphasized on the existing systems, their merits and flaws in them.

  • 12

    Chapter 3

    REQUIREMENT ANALYSIS

    3.1 Functional Requirements

    1. A true reflection of lexical similarity

    Strings with small differences should be recognized as being similar. In particular, a

    significant substring overlap should point to a high level of similarity between the strings.

    2. A robustness to changes of word order

    Two strings which contain the same words, but in a different order, should be recognized

    as being similar. On the other hand, if one string is just a random anagram of the

    characters contained in the other, then it should (usually) be recognized as dissimilar.

    3. Language Independence

    The system should work not only for English words, but also for Nepali words.

    4. Output file format

    The result should be stored in a file in comma separated variable (csv) format.

    5. Easy integration

    The system should be easy to integrate with the existing system. The system should be

    easy to maintain by the maintenance personnel.

    3.2 Non-Functional Requirements

    3.2.1 Reliability

    It is required that the system should be available all the time. This can be achieved by hosting

    the system in a reliable server. Also the system is built using Java, this adds more confidence

    to the system. Java has built in memory management system.

    3.2.2 Performance

    The system would be used by numerous customers throughout the country. So it was required

    that the system should take minimum time to produce output. The main concern was the time

    taken to query database system to extract the relevant names and calculate the similarity

    scores. This time depend upon the type of processor used. The overall time required to obtain

    output after the submission of name by the customer was summed up to about 1 minute but

    again, this time depends upon the number of tokens generated.

  • 13

    3.2.3 Accuracy

    The system is purposed to be real time, so it is required that the high accuracy is maintained.

    This is ensured by using Morphanalyser, Levenshtein Algorithm in conjunction with Kuhn-

    Mukres Hungarian Algorithm and Sorensen Algorithm.

  • 14

    Chapter 4

    METHODOLOGY

    4.1 Introduction

    Methodology is analysis of the tasks to be done in order to obtain the desired output. An

    appropriate methodology mainly results into a successful project and vice-versa. Here, for

    this system, a number of methodologies were considered and the most efficient ones are

    used. This doesnt mean that one particular method is used. According to the system, the

    most appropriate ones are used in combination.

    The model used here is an iterative model i.e. in the beginning a small subset of the software

    requirement is developed and then using the concept of redesign and redevelopment its

    further versions are enhanced. This process is continued until and unless the desired system

    is developed that produces results as mentioned in the system requirements.

    The methodology once decided is changed during the project if there arise any circumstances

    where the design emerged any flaws. Thus based on the situations appropriate methodologies

    are implemented. Hence in our scenario methodology comprises of five different steps.

    1. Building Base Dictionary

    2. Possible Keyword Generation

    3. Finding Possible Matches

    4. Finding Duplicates

    5. Finding Ranks

    1. Building Base Dictionary

    A base dictionary can be generated from the existing name database provided by OCR. This

    can be done by using manual approach. Base dictionary used in our project consist of a file

    containing English words, a dictionary for transliteration, and Nepali to English dictionary

    (provided by Madan Pustakalaya). These dictionary helps us in tokenization and possible

    keywords generation.

    2. Possible Keyword Generation:

    After tokenizing the given name, a possible combination of the keywords is generated using

    both English and Nepali words similar to them. After obtaining base keywords, these

    keywords are transliterated and combined in every possible manner to form the probable

    similar keywords. These keywords are used to match against names in OCR Names database.

  • 15

    3. Finding Possible Matches:

    Possible names generated using base keywords are matched against OCR names database.

    For this, the names containing any of the keywords are extracted from the names database.

    Each of the name is checked against the purposed name. The aim is to collect as many records

    as possible for better results. These records can contain duplicates too.

    4. Finding Duplicates Matches:

    The names extracted from the Names database may occur more than once. So, the names

    that appear more than once are removed. Duplication occurs when a name in the database

    contains two or more of the probable keywords.

    5. Finding Ranks:

    The purposed name is assigned a value against each name extracted from the Names

    database. The value signifies the extent of matching. For calculating the match, we used

    Levenshtein algorithm

    The Kuhn-Munkres algorithm (also known as the Hungarian method)

    The purposed name is assigned a value against each name extracted from the Names

    database. The value signifies the extent of matching. For calculating the match, we used

    Levenshtein algorithm to calculate similarity between tokens of purposed name and name

    extracted from database.

    The Kuhn-Munkres algorithm (also known as the Hungarian method) to find the optimal

    assignment of similarity weight between tokens of two strings in comparison that maximizes

    the sum of similarity weight.

    Sorensons similarity coefficient to find the single value similarity score (which is between

    0 and 1) from the result obtained through Hungarian.

    4.2 System Design

    This section gives a detail review on the design on which the system developed is

    implemented. It includes

    1. Flow diagram

    2. Deployment diagram

    3. System architecture

    4. Detail class diagram

  • 16

    4.2.1 Flow Diagram

    Figure 1 Flow Chart

  • 17

    4.2.2 Deployment Diagram

    The application is built around client/server architecture. Multiple client machines can

    interact with the server simultaneously. Clients can interact with the system through an

    interactive OCRs website, while the server serves the clients request and does the

    processing in the backend.

    Figure 2 Deployment Diagram

  • 18

    4.2.3 System Architecture

    User Input

    Query Processing

    Preprocessing

    Engine

    Translator + Transliterator

    Dictionary

    English-Nepali

    Keywords Generator

    Result

    Visualization Database

    Index Processor

    Indexed

    Record Preprocessing

    Engine

    Comparator

    Ranking Engine

    Figure 3 System Architecture

  • 19

    4.2.3.1 Preprocessing Engine

    Preprocessing Engine comprises of five different processes on the user input.

    1. Downcasting

    Downcasting also referred as type refinement is act of casting script from uppercase

    letters to lowercases. It is done so as to make sure there is no conflict in company names

    due to uppercase letters between the words to make it a unique name.

    2. Transformation

    Transformation is the conversion of words from British English word to that to American

    English words. Transformation is done to avoid the generation of unwanted keywords or

    conflicting keywords. Our dictionary consist of around 130 commonly used words that

    is converted when found from British English word to American English word.

    3. Stopword Removal

    Stop word removal is the process of removing some predefined stop words from the

    string literal. We used this process to remove the words that are considered

    Transformation

    Stopword removal

    Tokenization

    Stemming

    Pool of stopwords

    Downcasting

    Figure 4 Preprocessing Engine

  • 20

    similar/unimportant according to the Office of the Company Registrar. Words such as

    Shree, New, Modern, Industry, Udyog, Company, etc. are removed.

    4. Tokenization

    Tokenization is the process of breaking up a string into tokens to be indexed using

    predefined dictionaries or with the help of analyzing the whitespaces. These dictionaries

    can be a pool of predefined words or bilingual English-Nepali dictionary. Proper

    handling of strings, numbers and symbols are also important. For instance, tokenizing

    "nepal metals outputs nepal and metals.

    5. Stemming

    Stemming is the process of reducing a word to a root, or simpler form which are present

    in plural forms. Stemming is often used in text processing applications. There are many

    different approaches to stemming, each with their own design goals. Some are

    aggressive, reducing words to the smallest root possible. Here, Stemming is done with

    the help of morphological analyzer. Morphological analysis is done in order to produce

    English dictionary based words. For example, words like services, metals are

    reduced to simpler singular forms as service and metal.

    We used stemming to obtain the dictionary based root words. Using root words, we

    simplified the matching process.

    4.2.3.2 Translation and Transliteration

    Translation is the conversion of the meaning of a source-language text by means of

    an equivalent target-language text. In this process, equivalent Nepali text is obtained of the

    English words as obtained by mapping each keyword matched accordingly with the English

    Dictionary. The matched word are then mapped with the English-Nepali Dictionary provided

    by Madan Puraskar Pustakalaya. The unmatched words are simply placed with translated

    tokens. For Example the word nepal, metal is mapped onto the dictionary to get the word , .

    Transliteration is the conversion of a text from one script to another. To transliterate a

    Nepali word to English word, we used dictionary mapping to map individual Nepali syllable

    to form English alphabet. Here in above example of translation the word , are

  • 21

    transliterated to Nepal and dhatu and then extracted to the pool of keywords for further

    processing.

    4.2.3.3 Possible Keyword Generation

    Keywords are generated hence by the combination of keywords from stemming and after

    transliteration. The generated keywords are hence used to make a list of company names

    having those keywords in their names from the database .The company names are hence

    listed in accordance with the presence of those keywords. Each company name in the list is

    again processed by the preprocessing engine and stemmed keywords are extracted to process

    further for comparison which is kept as indexed record for each company name taken from

    the database.

    4.2.3.4 Comparison

    Comparison is done between the token obtained with the user inputted company name and

    tokens generated by the company names extracted from the database based on the user

    inputted keywords.

    Levenshtein Algorithm and The Kuhn-Munkres algorithm (Hungarian Method) were used

    in comparison of strings. The similarity is calculated in three steps:

    Partition each name into a list of tokens.

    Eliminate the common tokens

    Compute the similarity between dissimilar tokens by using a string edit-distance

    algorithm

    The first method uses an edit-distance string matching algorithm: Levenshtein. The string

    edit distance is the total cost of transforming one string into another using a set of edit rules,

    each of which has an associated cost. Levenshtein distance is obtained by finding the

    cheapest way to transform one string into another. Transformations are the one-step

    operations of (single-phone) insertion, deletion and substitution. In the simplest version

    substitutions cost about two units except when the source and target are identical, in which

    case the cost is zero. Insertions and deletions costs half that of substitutions.

    Application of Hungarian Algorithm for Optimization

    The result of Levenshtein method is used in bipartite graph which used Hungarian algorithm.

    A related classical problem on matching in bipartite graphs is the assignment problem, which

  • 22

    is the quest to find the optimal assignment of workers to jobs that maximizes the sum of

    ratings, given all non-negative ratings Cost[i,j] of each worker i to each job j.

    All relation scores are in the [0, 1] range, which means that if the score gets a maximum

    value (equal to 1) then the two string are absolutely similar.

    Application of Sorensons Similarity coefficient

    The result of Hungarian method which is the sum of similarity weight is then applied to

    Sorenson Index to find the final single value similarity score between the strings to be

    compared. This final score (whose value lie between 0 and 1) is then converted into

    percentage by multiplying by 100.

    4.2.3.5 Ranking

    The result of each and every permutation is taken into consideration and the maximum

    matched percentage score is chosen. And then, a list of company name is generated based on

    the order of the percentage similarity score.

  • 20

    4.2.4 Detailed Class Diagram

    Figure 5 Detailed Class Diagram

  • 21

    The system is implemented by using the object oriented methodology. We have not used

    Framework of any kind. Some of the core classes of system along with their association is

    shown.

    Comparison System

    This system is used to compare the result received from preprocessing engine of user input

    and list obtained from database

    1. HungarianAlgorithmEdu Class

    In this class we have used Hungarian algorithm to compute the highest possible score

    of matching between the tokens from both input. The input to this system is the

    weight matrix obtained from Hybrid Class and the output will be the similarity score.

    hgAlgorithm() method performs the Hungarian algorithm and final similarity score

    is returned by getScore() method.

    2. Hybrid Class

    In this class we have used Levenshtein Distance algorithm to calculate the edit

    distance. This class calculates edit distance between two tokens of strings and finally

    gives the similarity score between them. ComputeDistance() method computes the

    edit distance and GetSimilarity() returns the simalirity between tokens.

    3. Permutation Class

    In this class we perform permutation of the result obtained from transliteration of

    user input token and user input token but among the tokens of itself. permute()

    method computes permutation operation.

    4. MatchsMaker Class

    This is the main class of comparison system which calls each of its component to

    perform comparision and return output as similarity percentage. GetScore() returns

    the similarity percentage and Initialize() initializes necessary components.

  • 22

    Database System

    1. DatabaseCredentials Class

    This class is used to store database credentials. Those credentials includes username,

    password and connection path. This method can also be used as Java Beans to

    implement set/get methods.

    2. DatabaseHandler

    DatabaseHandler class is used to initiate the database connection and also declaring

    the database type.

    3. CookSQL Class

    This class is used to prepare SQL statements.

    4. CompanyNameEnglish Class

    This class is the core of the package. This class contains the methods for individual

    record manipulation and resultset retrieval.

    5. ConnectDatabase

    This class is the bridge between database and the main interface and other class. This

    class is used to hide the details of the underlying database implementations.

    Preprocessing Engine

    This engine contains component that is used to downcast, clean, transform. Remove stop

    words, stem and tokenize.

    1. SpaceProcessor Class

    This class is used to tokenize a company name based on space and hyphen (-) and

    rejoin the individual tokens if necessary.

    getSplittedText():This method is used to split the company name into tokens.

    joinSplittedText():This method is join tokens with space to regenerate the company

    name.

    2. StopwordRemover Class

    This class is used to remove the stop words as defined by the OCR directives.

    3. Stemmer Class

    Stemmer class contains methods to generate root words. Stemming is achieved using

    SnowBall stemmer and morphological analysis.

    4. SymbolProcessor

    This class is used to clean the illegal symbols from names.

  • 23

    4.3 Project Tools

    Programming Language: Java SE 7

    Database: MySQL Server Version 5.1.41

    Testing: JUnit testing

    Drawings: MS Paint, MS Visio, ArgoUML ,Adobe Photoshop

    Documentation: MS Word/Excel/PowerPoint

    Platform: Windows

    IDE: Eclipse Indigo

    4.4 Eclipse as Programming IDE

    Eclipse was used as IDE for project development. Eclipse is a multi-language software

    development platform comprising an IDE and a plug-in system to extend it. It is written

    primarily in Java and is used to develop applications in this language and, by means of the

    various plug-ins, in other languages as wellC/C++, COBOL, Python, Perl, PHP and more.

    The initial codebase originated from Visual Age. In its default form it is meant for Java

    developers, consisting of the Java Development Tools (JDT). Users can extend its

    capabilities by installing plug-ins written for the Eclipse software framework, such as

    development toolkits for other programming languages, and can write and contribute their

    own plug-in modules. Language packs provide translations into over a dozen natural

    languages. Released under the terms of the Eclipse Public License, Eclipse is free and open

    source software.

    4.5 MySQL as Database System

    MySQL was used as database server. It is a relational database management system

    (RDBMS) which has more than 11 million installations. The program runs as a server

    providing multi-user access to a number of databases. The project's source code is available

    under terms of the GNU General Public License, as well as under a variety of proprietary

    agreements.

  • 24

    Chapter 5

    EXPERIMENTAL SETUP

    Hardware Configuration used for Testing

    Hardware Configuration:

    Computer Model: DELL 5110

    Physical Memory (RAM): 4.00 GB, DDR2

    Processor: Intel(R) Core(TM) i-5-2450M CPU, 2.5 GHz

    System Type: 64-bit Operating System, x64-based processor

    Cache Size: 4096 KB

    OS: Windows 8 Enterprise

    Database: MySQL Server Version 5.5.24

    Database with 111,161 records of company names.

    Computer Model: Acer Aspire E1-531

    Physical Memory (RAM): 4.00 GB, DDR2

    Processor: Intel B960 Dual Core processor (2.2 Ghz, 2MB L3 cache)

    System Type: 64-bit Operating System, x64-based processor

    Cache Size: 4096 KB

    OS: Windows 8 Enterprise

    Database: MySQL Server Version 5.5.24

    Database with 111,161 records of company names.

  • 25

    Chapter 6

    OUTPUT

    1. Output obtained by using input durga enterprises

    2. Output obtained by using input hamro lagani

    3. Output obtained by using input jagadamba steels

    4. Output obtained by using input nawayug vidhya niketan kanchanpur

    Figure 6 Example - I

    Figure 7 Example - II

    Figure 8 Example- III

    Figure 9 Example - IV

  • 26

    5. Output obtained by using input nepal investment company

    6. Output obtained by using input nepal one travels and tour

    7. Output obtained by using input new age business consultant

    Figure 10 Example- V

    Figure 11 Example - VI

    Figure 12 Example - VII

  • 27

    Chapter 7

    RESULT AND ANALYSIS

    To obtain the similarity scores, we tried various similarity measuring algorithms. However

    Levenshtein Algorithm and Hungarian Algorithm together with Sorensen Algorithm seemed

    to fit our need. We used various processes before applying these algorithms which proved to

    be fruitful. The scores obtained is saved in file having .csv extension. Stemming was used to

    obtain dictionary based root words. Tokenization and transliteration was used to obtain the

    tokens later used in the comparison process. We used translation and transliteration to cope

    with Nepalese words. The accuracy was accessed by trying different names that can be used

    in reality.

    The computation time depends upon the number of tokens to be compared and for now, the

    system is single threaded.

    Figure 13 Computation Time with Transformation

    Figure 6 shows the relation between number of tokens and time to compute similarity scores

    with various generations of Intel Processors. The computational time is more in lower

    generation of processors and less in higher generation of processor. Furthermore, more is the

    tokens greater is the computation time. This result is obtained without the use of

    transformation process.

    1.179 1.4342.395

    5.9395.384

    7.316

    22.743

    53.785

    0

    10

    20

    30

    40

    50

    60

    1 Token (DurgaEnterprises)

    2 tokens (jagadamba steelspvt.ltd)

    3 tokens (New AgeBusinness Consultant

    Limited)

    4 tokens (Nepal Onetravels and tours Ltd.)

    Tim

    e to

    Co

    mp

    ute

    (se

    c)

    Number of Tokens

    Number of Tokens VS Computation Time

    Time to compute (sec) in I5 CPU

    Time to compute (sec) in Dual Core CPU

  • 28

    Figure 14 Time Computation with Transformation

    Figure 7 shows the result obtained by using Transformation process. It takes more time with

    using transformation, but it yields better results. By using appropriate hardware resources,

    we can reduce this time within the constraint.

    For comparison process, we initially used Cosine similarity algorithm. But it didnt yield

    promising results. Cosine similarity algorithm doesnt consider about the relative position of

    alphabets in the string, it only considers the repetition of alphabets. Thus a string with

    different spelling but same alphabet count is considered similar. This resulted in severe

    limitation of its use.

    Levenshtein algorithm proved useful in our project. It considers the position of alphabets in

    a string which is necessary for our system. This algorithm along with Hungarian Algorithm

    resulted in the satisfactory results. To obtain the final score we used Sorensen coefficient. Its

    value lies in the range [0, 1]. Multiplying this coefficient by 100 gave us the final

    percentage score.

    1.664 2.204

    11.952

    37.743

    8.95913.315

    39.994

    107.498

    0

    20

    40

    60

    80

    100

    120

    1 Token (DurgaEnterprises)

    2 tokens (jagadambasteels pvt.ltd)

    3 tokens (New AgeBusinness Consultant

    Limited)

    4 tokens (Nepal Onetravels and tours Ltd.)

    Tim

    e to

    Co

    mp

    ute

    (se

    c)

    Number of Tokens

    Number of Tokens VS Computation Time

    Time to compute (sec) in I5 CPU

    Time to compute (sec) in Dual Core CPU

  • 29

    Chapter 8

    CONCLUSION AND FURTHER ENHANCEMENT

    7.1 Conclusion

    With all the accumulated effort invested in this project, there are reasons to believe that at

    the end of this semester this project will find itself in a much better shape and quite closer to

    actual acceptance than it was. We summarize the progress with respect to the main objectives

    of the project, namely, accuracy and speed.

    Accuracy: This is the main obstacle for the project. We have been constantly using

    and testing many different algorithms for similarity comparison. However we have

    been able to get satisfactory results using Levenshtein distance and Hungarian

    Method in conjunction with Sorensen Coefficient. We are further trying to improve

    the results by employing many other algorithms Phonetic (Double Metaphone) and

    using transformation function.

    Speed: Speed is also a challenging factor for this project. The requirement for shorter

    processing time has made it difficult to balance between accuracy and speed.

    However by using the processing capability of MySQL, we have been able to

    improve the speed resulting in shorter waiting time for the users. The use of adequate

    data structures have been of prominent advantage.

    Let us remark that one of the apparent major obstacles for gaining acceptance for this

    project lies in the standards of the Office of Company Registrar.

    7.2 Limitations

    Our System comprises of the following limitations.

    The system cannot process name having numbers as prefix or suffix.

    Preprocessing Engine have many limitations. Stemming sometimes produces

    incorrect results if the input is the Nepali word. E.g. Spat () in Nepali (Steel in

    English) may result in spit due to morphology based stemming. In such cases,

    similarity matching reduces.

    Dictionary (English-Nepali) does not contain enough words. There are many English

    words for which Nepali word is not available

    Transformation process results in more computational time.

  • 30

    Synonyms are not considered in the system.

    Strings such as papermill and paper mill, though similar, are considered different

    because of the space. The space results in two tokens. Although both strings have

    same meaning, they are not considered similar by our system.

    7.3 Further Enhancement

    There is a great opportunity to enhance this project in upcoming future. The Similarity

    Checking algorithm has the greatest possibility of being enhanced. If phonetic based

    similarity measures is incorporated, accuracy can be greatly improved. Implementing faster

    searching methods can greatly enhance the performance of the system.

    Use of Taxonomy for classifying the tokens further with similarity measures can help

    accurately validate purposed names. Taxonomy can classify the context of names and thus

    improve the validation process.

    Furthermore, using some weighing measures to assign weights to most common words might

    be helpful in increasing accuracy of the similarity score.

  • 31

    REFERENCE

    [1] Office of Company Registrar, Nepal. Retrieved from: www.ocr.gov.np. Date Retrieved:

    07/04/2013

    [2] Companies House. Retrieved from:

    http://wck2.companieshouse.gov.uk//wcframe?name =accessCompanyInfo. Date

    Retrieved : 04/07/2013

    [3] Companies and Intellectual Property Commission. Retrieved from:

    http://www.cipc.co.za/.

    Date Retrieved: 04/07/2013

    [4] Anne Kao and Stephen R. Poteet (Eds). Natural Language Processing and Text Mining.

    Springer 2006

    [5] Peter Jackson and Isabelle Moulinier. Natural Language Processing for Online

    Applications .In Prof. Ruslan Mitkov, editor. John Benjamins Publishing Company,2002

    [6] Ronan Collobert, JasonWeston, Leon Bottou, et al. Natural Language Processing

    (Almost) from Scratch. Editor. Michael Collins. NEC Laboratories America, 4

    Independence Way, Princeton, NJ 08540

    [7] Prakash M Nadkarni, Lucila Ohno-Machado, Wendy W Chapman. Natural language

    processing: an introduction. Available from: group.bmj.com

    [8] Chris Manning, Hinrich Schtze. Foundations of Statistical Natural Language

    Processing. MIT Press. Cambridge, MA: May 1999. Available from:

    http://nlp.stanford.edu/fsnlp/

    [9] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI

    2001. Available from http://www.ebooksdirectory.com/details.php?ebook=6774

    [10] Danil de Kok, Harm Brouwer. Natural Language Processing for the Working

    Programmer. 2011. Available from : http://nlpwp.org/book/

    [11] Aliseda, R. van Glabbeek, D. Westerstahl. Computing Natural Language. CSLI

    1998. Available from: http://www.e-booksdirectory.com/details.php?ebook=3940

    [12] Steven Bird, Ewan Klein, Edward Loper. Natural Language Processing with

    Python.

    O'Reilly Media 2009. Available from:

    http://www.ebooksdirectory.com/details.php?ebook=7184

  • 32

    [13] Rob Malouf, Miles Osborne. An Introduction to Stochastic Attribute-Value

    Grammars. ESSLLI 2001.Available from:

    http://www.e-booksdirectory.com/details.php?ebook=6860

    [14] Shuly Wintner. Formal Language Theory for Natural Language Processing. ESSLLI

    2001.Available from: http://www.e-booksdirectory.com/details.php?ebook=6774

    [15] Grosz, B.J. Jones, K.S.Webber, B.L. Readings in Natural Language Processing.

    Kaufman Publishers Inc.,Los Altos, CA. Available from:

    http://www.osti.gov/energycitations/product.biblio.jsp?osti_id=6537037

    [16] Reilly, Ronan G. (Ed); Sharkey, Noel E. (Ed). Connectionist approaches to natural

    language processing. Hillsdale, NJ, England: Lawrence Erlbaum Associates, Inc. 1992.

    Available from: http://psycnet.apa.org/psycinfo/1992-98664-000

    [17] C. Friedman and R. Sideli. Tolerating spelling errors during patient validation.

    Computers and Biomedical Research, 25:486509, 1992.

    [18] F. Patman and P. Thompson. Names: A new frontier in text mining. In ISI-2003,

    Springer LNCS 2665, pages 2738.

    [19] Simon J. Greenhill. Computational Linguistics Volume 37 Issue 4, December 2011,

    pages 689-698.

    [20] M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string

    similarity measures. In Proceedings of ACM SIGKDD, pages 3948, Washington DC,

    2003.

    [21] C. L. Borgman and S. L. Siegfried. Gettys synonameTM and its cousins: A survey

    of applications of personal name matching

    [22] Algorithms. Journal of the American Society for Information Science, 43(7):459

    476, 1992.

    [23] P. Christen, T. Churches, and M. Hegland. Febrl a parallel open source data linkage

    system. In PAKDD, Springer LNAI

    [24] 3056, pages 638647, Sydney, 2004.

    [25] P. Christen and K. Goiser. Quality and complexity measures for data linkage and

    deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining,

    Studies in Computational Intelligence. Springer, 2006.

    [26] P. A. Hall and G. R. Dowling. Approximate string matching. ACM Computing

    Surveys, 12(4):381402, 1980. [25] P. Jokinen, J. Tarhio, and E. Ukkonen. A comparison

  • 33

    of approximate string matching algorithms. Software Practice and Experience,

    26(12):14391458, 1996.

    [27] R. Gong and T. K. Chan. Syllable alignment: A novel model for phonetic string

    search. IEICE Transactions on Information and Systems, E89-D(1):332339, 2006.

    [28] F. J. Damerau. A technique for computer detection and correction of spelling errors.

    Communications of the ACM, 7(3):171176, 1964.

    [29] G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys,

    33(1):3188, 2001.

  • 34

    APPENDIX A: Gantt chart

    Figure 15 Gantt Chart

  • 35

    APPENDIX B: Use Case

    Figure 16 Use Case Diagram

  • 36

    APPENDIX C: Preprocessing Detail Example

    For User Input

    Methodology Example Nepal Metals Industries Process

    Pre

    pro

    cess

    ing E

    ngin

    e

    Downcasting nepal metal industries Conversion of input to lowercase.

    Transformation Not Applied in this example Conversion of British English words to American English.

    Stopword

    Removal

    nepal metals Removal of Stop words Company, Industry, and Pvt.Ltd.as mentioned in the draft.

    Tokenization [nepal,metals] Extraction of Tokens

    Stemming [nepal,metal] Reduction to Root words.

    Translation [nepal,metal] to [ , ] Conversion of tokens from English to Nepali.

    Transliteration [ , ] to [nepal , dhatu ] Conversion of Nepali Unicode.

    Generated Keywords (Using Transliterated Token + Stemmed Token)[ nepal , metal , dhatu ]

    Query to MySQL Database resulting in a list of company names.

    Company Name Extraction

    from Query

    Example(Randomly choosen)

    Royal Metal Nepal Pvt.Ltd. Process

    Pre

    pro

    cess

    ing

    En

    gin

    e

    Downcasting royal metal nepal pvt.ltd. Conversion of input to lowercase.

    Transformation Not Applied in this example Conversion of British English words to American English.

    Stopword

    Removal

    royal metal nepal Removal of Stop words Company, Industry, and Pvt.Ltd.as mentioned in the draft.

    Tokenization [royal , metal , nepal ] Extraction of Tokens

    Stemming [royal , metal , nepal ] Reduction to Root words.

    Database Generated Keywords

    [ royal , metal , nepal ]

    Comparison-1 (User Input Generated Keywords & Database Generated Keywords.)

    Company Name Extraction

    from Query

    Example(Randomly choosen)

    Nepal Dhatu Industries Process

    Pre

    pro

    cess

    ing

    En

    gin

    e

    Downcasting nepal dhatu industries Conversion of input to lowercase.

    Transformation Not Applied in this example Conversion of British English words to American English.

    Stopword

    Removal

    nepal dhatu Removal of Stop words Company, Industry, and Pvt.Ltd. as mentioned in the draft.

    Tokenization [nepal , dhatu ] Extraction of Tokens

    Stemming [nepal , dhatu ] Reduction to Root words.

    Database Generated Keywords

    [nepal , dhatu ]

    Comparison-2 (User Input Generated Keywords & Database Generated Keywords.)

  • 37

    APPENDIX D: Comparison Detail

    Figure 17 Comparison I (Part A)

  • 38

    Figure 18 Comparison I (Part B)

  • 39

    Figure 19 Comparison II (Part A)

  • 40

    Figure 20 Comparison II (Part B)

  • 41

    APPENDIX E: Output Screenshot

    Figure 21 Output Screenshot

  • 42

    APPENDIX F: Data Flow Diagram

    Figure 22 Data Flow Diagram

  • 43

    APPENDIX G: Theory

    Hungarian Algorithm

    Hungarian Method is for assigning jobs by a one-for-one matching to identify the lowest-

    cost solution. Each job must be assigned to only one machine. It is assumed that every

    machine is capable of handling every job, and that the costs or values associated with each

    assignment combination are known and fixed. The number of rows and columns must be the

    same. The algorithm is as follows.

    1. Arrange the information in a matrix form with String 1 and String 2 on left and along the

    top with the Levenshtein distance for each pair in the middle.

    2. Ensure that the matrix is a square by addition of the dummy rows/columns if necessary.

    Conventionally, each element in the dummy row/column is the same as the largest

    number in the matrix.

    3. Reduce the rows by subtracting the minimum value of each row from that row.

    4. Reduce the columns by subtracting the minimum value of each column from that column.

    5. Cover the zero elements with the minimum number of lines it is possible to cover them

    with.(if the number of lines is equal to the number of rows then goto step 9)

    6. Add the minimum uncovered element to every covered element, if an element is covered

    twice, add the minimum element to it twice.

    7. Subtract the minimum element from every element in the matrix.

    8. Cover the zero elements again. If the number of lines covering the zero elements is not

    equal to the number of rows, return to step 6.

    9. Select a matching by choosing a set of zeros as that each row or column has only one

    selected.

    10. Apply the matching to the original matrix, disregarding dummy rows.

  • 44

    Procedure of Metaphone Phonetic Algorithm

    Original Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY.[2] The

    '0' represents "th" (as an ASCII approximation of ), 'X' represents "sh" or "ch", and the

    others represent their usual English pronunciations. The vowels AEIOU are also used, but

    only at the beginning of the code.[3] This table summarizes most of the rules in the original

    implementation:

    1. Drop duplicate adjacent letters, except for C.

    2. If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.

    3. Drop 'B' if after 'M' at the end of the word.

    4. 'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-

    ', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'.

    Otherwise, 'C' transforms to 'K'.

    5. 'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.

    6. Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if

    followed by 'N' or 'NED' and is at the end.

    7. 'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G'

    transforms to 'K'.

    8. Drop 'H' if after vowel and not before a vowel.

    9. 'CK' transforms to 'K'.

    10. 'PH' transforms to 'F'.

    11. 'Q' transforms to 'K'.

    12. 'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.

    13. 'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if

    followed by 'CH'.

    14. 'V' transforms to 'F'.

    15. 'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.

    16. 'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.

    17. Drop 'Y' if not followed by a vowel.

    18. 'Z' transforms to 'S'.

    19. Drop all vowels unless it is the beginning.