Shallow parser for hindi language with an input from a transliterator

Shashank 10503883

Harshit Goel 10103559

B-Tech Project

Project Mentor : Ms. Parmeet Kaur

Shallow Parser With Input From A

Transliterator

Introduction Literary Review Problem Statement Plan of Action System Architecture Flow Chart Conclusion & findings References

Content

Shallow Parser

Morphological Analyzer

Transliteration

Introduction

Shallow parsing (also chunking, "light parsing") is an analysis of a sentence which identifies the constituents (noun groups, verbs, verb groups, etc.), but does not specify their internal structure, nor their role in the main sentence.

It is a technique widely used in natural language processing. It is similar to the concept of lexical analysis for computer languages.

Shallow Parser

A "parser" is a system that transforms sentences (strings of characters) into a representation that describes the groupings of words (phrases) and their relations (e.g. subject and object). The representation of choice for such information is a syntactic tree in which nodes refer to phrases, word categories, or words, and links refer to relations between these objects:

Why Shallow Parser?

Parsing the sentence into a tree whose leaves will hold POS tags (which correspond to words in the sentence), but the rest of the tree would tell you how exactly these words are joining together to make the overall sentence.

Example an adjective and a noun might combine to be a 'Noun Phrase', which might combine with another adjective to form another Noun Phrase (e.g. quick brown fox) (the exact way the pieces combine depends on the parser in question).

A shallow parser or 'chunker' comes somewhere in between these two. A plain POS tagger is really fast but does not give you enough information and a full blown parser is slow and gives you too much. A POS tagger can be thought of as a parser which only returns the bottom-most tier of the parse tree to you.

A chunker might be thought of as a parser that returns some other tier of the parse tree to you instead. Sometimes you just need to know that a bunch of words together form a Noun Phrase but don't care about the sub-structure of the tree within those words (i.e. which words are adjectives, determiners, nouns, etc and how do they combine). In such cases you can use a chunker to get exactly the information you need instead of wasting time generating the full parse tree for the sentence.

Difference b/w Shallow Parser and POS Tagger

Morphology

Morphology is the part of linguistics that deals with the study of words, their internal structure and partially their meanings. It refers to identification of a word stem from a full word form. A morpheme in morphology is the smallest units that carry meaning and fulfill some grammatical function.

Morphology

Morphological analysis

Morphological Analysis is the process of providing grammatical information of a word given its suffix.

Models

There are three principal approaches to morphology, which each try to capture the distinctions above in different ways. These are,

• Morpheme-based morphology also known as Item-and-Arrangement approach.

• Lexeme-based morphology also known as Item-and-Process approach.

• Word-based morphology also known as Word-and-Paradigm approach.

Morphological Analysis and Models


A morphological analyzer is a program for analyzing the morphology of an input word, it detects morphemes of any text.

Presently we are referring to two types of morph analyzers for Indian languages: 1. Phrase level Morph Analyzer 2. Word level Morph Analyzer


Transliteration is the conversion of a text from one script to another.

For instance:

kaay kam karato = का�य काम कारतो�kyaa chal rahaa hai = क्य� चल रहा� हा

Transliteration can form an essential part of transcription which converts text from one writing system into another. Transliteration is not concerned with representing the phonemics of the original

Transliteration

We have researched in detail about our project by means of research papers, blogs and internet. There are various approaches for the development of the morphological analyzers such as Finite State Automata (FSA) approach, Two Level Morphology approach, Finite State Transducers (FST) approach, Stemmer Algorithm, Corpus Based Approach, DAWG (Directed Acrylic Word Graph) and Paradigm Based Approach in which the FST based approach is the most efficient approach for the development of the morphological analyzer for Hindi that is highly inflectional language.

There are several approaches for the construction of Shallow parser such as Chunker based Shallow parser, HMM based Shallow parser, Memory based Shallow parser, Shallow parser based on conditional random fields and Shallow parser based on Winnow algorithm. Among these, Shallow parser based on conditional random fields is proven to be the most efficient and flexible approach. Shallow parsers are very essential tools for various NLP applications as they provide a complete set of the natural language while decreasing the complexity inherent in the complete parser. Thus, shallow parsers are important for applications that require only syntactic analysis of the sentence and don’t require relationships between the chunks of the sentence. This includes applications like auto-text summarization, speech-to-speech translation systems and text-mining applications.

Literary Survey-Summary

Many cultures around the world use different scripts to represent their languages. By transliterating, people can make their languages more accessible to people who do not understand their scripts. For example, to someone who knows the Roman alphabet, the name is incomprehensible. However, when it is محمدtransliterated as Muhammad, readers of the Roman alphabet understand that it means the Muslim prophet Muhammad.

So Transliterator helps the non-native speakers to type the Hindi phrase in Roman Script using any keyboard and thus providing the input for Shallow Parser

Literary Survey-Summary

We intent to develop a ‘Shallow Parser for Hindi Language’ and a FST based Morphological Analyzer which can be used as a tool in building more application specific tools like auto-text summarizer, speech-to-speech translators etc. Key objective of the project is to provide the shallow parser and morphological analyzer open source software.

We also want to develop a simple tool to convert roman script to Indic(Devanagari) script. As most keyboards are English, so to write in Indic script is difficult. It is easy to write Hindi in roman script this gives inspiration to make a tool for Linux to write Hindi text easily.

Problem Statement

Plan of Action

1. Transliteration

2. Lexicon Generator

3. Morphological Analyzer

4. Shallow Parsing

1. Transliterator

Figure: Block Diagram of transliteration process

It is a simple tool to convert roman script to Indic(Devanagari) script. As most keyboards are English, so to write in Indic script is difficult. It is easy to write Hindi in roman script this gives inspiration to make a tool for Linux to write Hindi text easily.

2. Lexicon Generator

Figure: Block Diagram of Lexicon Generation

There are three steps to process the corpus to extract the words. The first step is to extract the words from the given corpus' sentences. In the next step the duplicate words are removed to extract the unique words. After that the sorting of the words are done which makes easier to processing of the words manually such as the classification of the words. The lexicon files for each word classes are classified as per its inflection, and derivations types.

3. Morphological Analyzer

Figure: Architecture of the Morphological Processor

The analyzer takes the input, the word that is of surface form and produces the result as the grammatical structure of the word that is of the lexicon form. The Generator takes the input, the grammatical structure of the word that is lexicon form and produces the result, the corresponding word that is of surface form.

4. Shallow Parsing by CFG A CFG is a 4-tuple <N,E,R,S >

A set of non-terminals N (e.g. N = {S, NP, VP, PP, Noun, Verb, ....})

A set of terminals E (e.g. E = {In, the, popular, mythology, the, computer, is, a,

mathematics, machine })

A set of rules R

A start symbol S (sentence)

System Architecture

Flow Chart

Input : Ram School Jaata Hai.

Output1: र�म स्का� ल जा�तो� हा |

Transliterator

Shallow Parser

Output2: NP NP VP

NP – Noun PhraseVP – Verb Phrase

Findings and Conclusion

It is challenging to translate names and technical terms across languages with different alphabets and sound inventories. These items are commonly transliterated, i.e., replaced with approximate phonetic equivalents. An efficient shallow parser for Hindi is needed to build a full-blown parser.

Since proper nouns and technical terms — which need phonetical translation — are part of most text documents, transliteration is an important problem to study.

Found only few shallow parsers for Hindi Analysed different approaches for creating shallow parser Parsing by CFG is the used approach. Approach is labour-intensive as rules are crafted

manually.

References ‘Transliterated Search using Syllabification

Approach’ by Hardik Joshi, Apurva Bhatt, Honey Patel

‘Transliteration Systems Across Indian Languages Using Parallel Corpora’ by RishabhSrivastava and Riyaz Ahmad Bhat

‘Semi-Supervised Learning of Hindi Morphology’ by Teena Bajaj and Parteek Bhatia

‘Phonetically Rich Hindi Sentence Corpus for Creation of Speech Database’ by Vishal Chourasia, Samudravijaya K, Manohar Chandwani

Engineering

Shallow parser for hindi language with an input from a transliterator