39
Sanskrit Lexical Analyser (A project module of Universal digital library,IIIT-Allahabad)

sanskrit lexical analyser

Embed Size (px)

Citation preview

Page 1: sanskrit lexical analyser

Sanskrit Lexical Analyser

(A project module of Universal digital library,IIIT-Allahabad)

Page 2: sanskrit lexical analyser

Report by:

Rajeev Kumar 20031032, IIIT-Allahabad, India.

Email: [email protected] Guidance: Dr. Ratna Sanyal Dr. Sudip Sanyal Dr. U.S.Tiwary

Indian Institute Of Information Technology,Allahabad Devghat,Jhalwa, Allahabad, U.P. , India

Page 3: sanskrit lexical analyser

Abstract:

The Sanskrit language is one of the earliest attested members of the Indo-European language family .It is not only a classical language, but also an official language of India. By classical we means: ancient,originated on its own and rich in literature. It has a similar position in India to that of Latin and Greek in Europe, and is a central part of Hindu/Vedic traditions.Many of the Indian languages and foreign languages has it’s origin from sanskrit only.Paninian grammar is still regarded as the mother of all the grammar. So,Sanskrit has retained it’s position and charm in it’s original form The present project concerns with the Morphological and Lexical analysis of the sanskrit words. Here I have tried for “sandhi-wichched” (

siNx iv¢h ) and finding out “vibhakti’s” ( ivÉiKTa ) from “shabdroop”. I have designed my own rule format(by extending basic rules).

Page 4: sanskrit lexical analyser

Motivation & Objective Motivation

People both in India and abroad are surely and steadily realizing the importance of ancient scripts in the diverse fields of science, commerce and arts. Also, as spiritual awareness sweeps the world, great efforts are being made to present the vedic scriptures in different languages to the common people.Swami Dayanand saraswati… once rightly quoted “back to vedas”.

German have already developed their machine translator from Sanskrit to German. People at MIT’s also working at large scale on Sanskrit. In India the work on Sanskrit is being carried out at IIIT-Allahabad,IIT’s ,IIIT-Hyderabad etc.

Objective

The present project concerns with the Morphological and Lexical analysis of the sanskrit words.The lexical analyser may be applied for Machine Translation domain of Natural Language Processing(by mapping the parse tree of sanskrit language with other).Also,it may be used for information retrieval

Page 5: sanskrit lexical analyser

Lexicon analysis & NLP

Sentences Morphological & Syntactic analysis semantic analysis context analysis Lexical analysis. (Grammar and parse tree) (understanding the meaning) (Sandhi) (vibhakti) Root words --Gender, form -- marker How to proceed for morphological analysis?? Here we basically deal with finding out the root word. Since a given word may be composed of two words,so one may go for

separating these words. e.g. rvINd+ = riv + #Nd+ . Also, a marker may be attached to the word,

e.g. muinna muin_yam! Muini_a> (muninaa) (munibhyaam) (munibhiH) In above case muni-- naa muni— bhyaam markers muni-- bhiH So, if somehow we remove the marker then we may get the root word.

Page 6: sanskrit lexical analyser

Sandhi( siNx iv¢h )

• What is sandhi ??

Sandhi refers to combination of words when they are spoken with each other without a gap.

dae v[aeR me< AitZay saimPy ke kar[ jae ivkar %TpNn haeta hE ,//%se siNx

khte h<E,

• All together there are 5 + 14 + 7 = 26 crisp rules on which sandhi

vigrah is carried out.

5 for swar sandhi(Svr siNx )

14 for wyanjan sandhi( Vy<jn siNx)

7 for wisarg sandhi( ivsgR siNx)

• During formation of sandhi words, 3 cases to deal with:

1) effect on the first word only: (i.e. last letter of the first word changes)

st! + jn?> = sJjn> 2) effect on the second word only: (first letter of the second word changes)

v&] + Daya = v&]CDaya

3) effect on both of the word.(last letter of 1st word and first letter of 2nd word changes)

riv + #Nd+ = rvINd+

Page 7: sanskrit lexical analyser

• I Expand those sandhi rules(from 26 to about 100).

e.g. Ak> sv[eR dI"R> : it means Ak! ke bad yid sv[R Svr hae tae daenae<

Svr ke Swan me< dI"R hae jata hE,

Ak! A ,Aa ,# ,% ,\ ,l&

sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l& I expanded them in the form:

A + A = Aa a + a = aa

# + $ = $ i + ii = ii ….. And so on.i.e. each letter will be dealt independently

• Rule format: i.e. In what format the rules have been stored in the machine.

» effectOn - - result - - left - - right

e.g. # + # = $ i + i = ii

riv + #Nd+ = rvINd

effectOn takes values: f,s,b f: first s: second b: both

Page 8: sanskrit lexical analyser

e.g. riv + #Nd+ = rvINd Here , both the words (left and right) underwent changes. So, the rule format for the above rule will be: effectOn - - result - - left - - right b - - ii - - i -- i i.e rav—ii—ndra : if you get “ii” then add “i” on left i.e. rav + “i” and add “i” on the right…… “i”+ndra. So,u get ravi + indra Rule format:

Consider: Ak> sv[eR dI"R>

Ak! A ,Aa ,# ,% ,\ ,l&

sv[R Svr A ,Aa ,# ,$ ,% ,^ ,\ ,\& ,l& Rule format in the text rule file ): effectOn--result--left--right effectOn: character f : if result can be obtained by making changes in first s : if result can be obtained by making changes second b : if result can be obtained by making changes both result: string

left: OR separated strings (e.g. Ak! A |Aa| #|%|\|l& ) string|string|......|string|

right: OR separated strings(e.g. sv[R Svr A |Aa |# |$ |% |^ |\ |\& |l) string|string|......|string|

Page 9: sanskrit lexical analyser

Storage used:

I have stored rules in the files indexed on there starting letter. I.e. for above case, b - - ii - - i - - i the rule will be in the file “i.txt”. Similarly, words/root words are stored in the file indexed on starting letter. I.e. raviindra is stored in “r.txt”. Also only one word per line is stored in the corresponding file.

Sandhi algorithm (my approach) Step1: receive a word for doing sandhi wichched Step2: try iteratively for breaking the word into two part(left and right) e.g. if the word is kastu then: kast u kas tu ka stu k astu try for sandhi wichched.(note: sandhi wigrah is : word=leftWord+rightWord,

i.e. ending of leftWord and satarting of rightWord gets connected to form the sandhi now in order to detect the effect of addition(making up sandhi),i assumed that effected word is encapsulated in right word e.g. kastu=kaH + tu.... so when i receive "kastu" for sandhi wichched,then i take up left as "ka" and right as "stu",and then pass "ka"(left) and "stu"(right) for trying up sandhi wichched)

step3: Now we have two words: left,right (e.g. ka,stu) 1) sets a variable flag to false. 2) reads a rule from rulefile named as starting letter of right (e.g. here it will open s.txt, since "stu" starts with letter s)

3)try applying each of the sandhi rules(present in the rule file,here sandhi rules present in s.txt) on left and right(i.e. on ka and stu).

4) if one or more rules listed in corresponding rule file is applicable then flag is set to true.

Page 10: sanskrit lexical analyser

Step4: how to apply rule??

a) extract effectOn,result,left,right from the rule. b) if received secondString starts with result then try to apply the rule result=left+right here result is encapsulated in the secondString so extract right part from secondString by removing result from it(secondString) e.g. here secondString ( stu ) starts with result (s)

so rule f--s--H|--t|th| have the chances of performing sandhi wichched. c) separate left-result-right

e.g. ka-s-tu for(kastu) leftGuess : ka result : s rightGuess : tu

d) check for the certainty of sandhi formation by applying rules. e) check for the other chances/probability of sandhi formation:

if nothing is present in the rightGuess then,no sandhi.... so return false

rightGuess,leftGuess and result, all the three present in the database, then chances of sandhi vigrah

if leftWord is present then also thr. is some chance f sandhi-vigrah,since markers may not be present in the database

How to check certainty of sandhi(step 4.d)??

As stated earlier, while formation of sandhi,the effect may be on first word,or on second word or on both the words.

a)effect on first word:

example: f—y—i|ii|I|—a|aa|A|u|uu|U|e|ai|o|au|aM|aH|RRi|R^i|R^I|L^i|L^I|

So,if on forming sandhi only first word is effected and second word remains as it is then append something(depending on rule) to the leftGuess ........ no changes in rightGuess

check for left + right = result .....if the rightGuess word exist in database,also check right of rule ..... f--result--left--right could be applied (since,both, sandhi wigrah and sandhi formation on the given rule should be checked up). then depnding upon the presence of left hand side one may declare the result.

Page 11: sanskrit lexical analyser

left + right = sandhi

even if the left word is not present,but,since, rightGuess is present in database,so there are chances of sandhi vigrah so record it to tryResult.txt Since we are in effectOn=='f' so we need something to be added to the leftGuess and then check there presence in the database.

If rightGuess is allready present in database then, if the

leftGuess+append is also present in database then one can be sure of sandhi vigrah. Even if leftuess+append NOT present in database but,if there(leftGuess+append),sandhi wichched is possible then also one can be sure of sandhi-vigrah. e.g. gaNitakaavyayormadhuramelanam + avalokyate gaNitakaavyayormadhuramelanam is not present in database but gaNitakaavyayoH + madhuramelanam is possible

b) effect on second word: rule example: s--Dh--Sh|shh|T|Th|D|Dh|N|--dh|

So,if on forming sandhi only second word is effected and first word remains as it is then prefix something(depending on rule) to the rightGuess ........ no changes in leftGuess

check for left + right = result .....if the right(prefix+rightGuess) word exist in database. check right of rule ..... s--result--left—right could be applied then depnding upon the presence of left hand side one may declare the result. Since we are in effectOn=='s' so we need something to be prefixed to the rightGuess and then check there presence in the database.

even if the leftGuess word is not present,but,since,rightGuess is present in database,so there are chances of sandhi vigrah so record it to tryResult.txt

check if the leftGuess ends with one of the string present in left part of the rule since sandhi vigrah and sandhi formation both of them should be checked. rule: effectOn--result--left—right

right: orSeparatedString(string|string|......|string|) e.g. for rule s--Dh--Sh|shh|T|Th|D|Dh|N|--dh| left: Sh|shh|T|Th|D|Dh|N| e.g. leftGuess sheshh ends with shh(which is present Sh|shh|T|Th|D|Dh|N|)

Page 12: sanskrit lexical analyser

c) effect on both the words rule example: b--e--a|aa|A|--i|ii|I|

so,if on forming sandhi both(first and second) word gets effected.then (depending on rule) append something to the leftGuess and prefix something to the rightGuess.

for each of the potential prefix for rightGuess append a postfix(extracting from rule) to the left part and check presence of (leftGuess+append),(add+rightGuess) in database.

extract one of the prefix for rightGuess from the rule's right rule : b--e--a|aa|A|--i|ii|I| right: i|ii|I| potential prefix: i,ii,I

if (add+rightGuess)AND(leftGuess+append) both present then declare the

Result

left + right = sandhi So,even if the left word is not present,but,since, add+rightGuess is present in database,so there are chances of sandhi vigrah so record it to tryResult.txt

now for each prefix for the rightGuess: extract one of the postfix for leftGuess from the rule's left rule : b--e--a|aa|A|--i|ii|I| left : a|aa|A potential postfix/append: a,aa,A

If (add+rightGuess) is allready present in database,then, if the (leftGuess+append) is also present in database ,then one can be sure of sandhi vigrah

Page 13: sanskrit lexical analyser

Sandhi Algorithm in nut shell:

1. traverse the string in reverse order .

2. For k = n to 1, break the string at kth position into two : leftBreakStr, rightBreakStr

i) for starting char of rightBreakStr …..get the rulefile

ii)Try to apply each rule contained in this rulefile: -- extract the result part of the rule from the rightBreakstr -- Find out the effected Substring, leftStr = try to append something (left) to ftBreakstr rightStr=try to add something(right) in front of rightBreakStr. -- if Rootatabase( rightString ) : if ( Rootdatabase(leftString)) then set sandhiPossible : true if(sandhiwichched(leftString )) then set sandhiPossible : true

Page 14: sanskrit lexical analyser

Sample Output :

Page 15: sanskrit lexical analyser

Vibhakti( ivÉiKTa )

Vibhakti’s are Helpful in finding gender , form and markers. Vibhakti Though there are only six vibhakti’s in Sanskrit as compared to 8 in hindi, but we consider all the 8 vibhakti’s.

vachan(vcn )

In Sanskrit, there are 3 forms/vachans (vcn ) : @kvcn , iÖvcn , bhuvcn as compared to 2 in hindi.

Genders(il<g )

Again,in Sanskrit there are 3 genders (il<g ) : iôil<g ,puiLL<aGa ,npu<skiLa<g as

compared to Two in hindi/English.

Approach for Vibhakti( ivÉiKTa ) First thing is to detect markers. Once the marker is detected the work is done. By marker I mean:

Consider “goes”. Here the root word is “go” and the marker is “es”. Similarly in the word “going”, root word is “go” and the marker is “ing”. Similarly we have markers attached to the root word in sanskrit.

e.g. in the word “munibhyaam” (muin_yam! ) the root word is “muni”( muin)

and the marker is “bhyaam”( _yam! ).

Page 16: sanskrit lexical analyser

e.g. consider the following

@kvcn id&vcn bhuvcn

t&itya> muinna muin_yam! Muini_a> (muninaa) (munibhyaam) (munibhiH) In above case muni-- naa muni— bhyaam markers muni-- bhiH

So if I encounter “bhyaam” in the given word then I can say that it is t&itya>

ivÉiKTa ,id&vcn

Note: there is possibility that muin_yam! (munibhyaam) may occur at three places

i.e. For muni(muin), muin_yam! Comes at t&itya> , ctuiwR , p<cim so,these cases can’t be dealt at word level.(can be sorted out at sentence level only)•

Vibhakti Algorithm The algorithm to handle vibhakti is almost similar to those of sandhi but here we have to consider the following representation.

vachan : e d b (@kvcn , iÖvcn , bhuvcn) vibhakti : 1 2 3 4 5 6 7 8

ling : m f n (iôil<g ,puiLL<aGa ,npu<skiLa<g) rule : extract--add--vibhakti--vachan--ling- -ending swar

extract--add--vibhakti--vachan--ling - -ending swar e.g. for munibhyaam ibhyaam--i--3|4|5--d--m—i i.e. If you encounter the word munibhyaam, then extract “ibhyaam”,then add “i” then get the root word.

Page 17: sanskrit lexical analyser

Information Gain : karak : t&itya> , ctuiwR , p<cim

vachan : id&vcn gender : male with ending word i,

i.e. #kara<t puiLl<g zBd Storage used:

I have stored rules in the files indexed on there starting letter. I.e. for above case, ibhyaam--i--3|4|5--d--m—i the rule will be in the file “i.txt”. Similarly, words/root words are stored in the file indexed on starting letter. I.e. raviindra is stored in “r.txt”. Also only one word per line is stored in the corresponding file.

Steps: Step1: receive a word for detecting karak,ling and vachan. Step2: try extract the marker from the word.since markers are appended at Last so process on each letter of the word in reverse order and try to separate the root from marker. a) separate the word into to part from jth index e.g. ramAH, then ramA H ram AH ra mAH r amah b) obtain all possible rules that can be applied to the given word. how to obtain the rules?? e.g. ramA H then open the file H.txt and get all possible rules from the H.txt and

store them in array rule[]

Page 18: sanskrit lexical analyser

c) iteratively apply all rules stored in the array rule[] rule format : result--add--vibhakti--vachan--ling--ending word e.g. rule format: AH--a--1|8--b--m--a vachan : e d b vibhakti : 1 2 3 4 5 6 7 8 ling : m f n d) get starting index and ending index of the word to be added (add part from the rule format)and hence extract the add part from the rule. rule: AH--a--1|8--b--m--a add: a e) check if the word to be added is $(empty),if so then place blank istead of $ f) add the word to the left part of query word. first: ram second: AH rule: AH--a--1|8--b--m--a add: a firstRes: rama g) heck if the firstRes in DATABASE?? if the word found then project the result i.e. ling ,vachan , karak

i.e. vibhakti is possible, so break the corresponding into result. rule: AH--a--1|8--b--m--a e.g. root: rama karak: 1|8 vachan: bahuvachan ling: male

Page 19: sanskrit lexical analyser

Algorithmic Steps: Step 1: fetch the word Step2:

extract the vibhakti:(raamaabhyaam)====> possible>am..aam..abhyaam ==>extract using the last index i.e. m better store the rules based on last index

if u extract am and add a ..then left : raamaabhyaa not found in db if u extract aam and add a ..then left : raamaabhyaam not found in db if u extract abhyaam and add a ..then left : raamaa found in db ==> declare it's ling vachan karak based on it's ending e.g. based on rup i.e. akaaraant pulling shabda aakaarant pulling shabda ikaarant pulling shabda e.tc.

Page 20: sanskrit lexical analyser

Requirement Analysis: Hardware: CPU with minimum of 2.40 GH

256 MB of RAM Softwares: JDK version 1.4.2 or above Itranslator.exe (freely available,can be downloaded ) Link: http://www.omkarananda-ashram.org/Sanskrit/itranslator99.htm How to run: Package : sandhi.sandhi Src/sandhi/sandhi : contains all java files

1) frame.java 2) Sanskrit.java 3) Sandhi.java 4) Vibhakti.java 5) Utils.java

Data: data directory contains 1) Database (root + others) 2) sandhiRules 3) vibhaktiRules

Main function() : in sandhi.sandhi.frame.java load and run frame.java i.e. sandhi.sandhi.frame.java Input / Output: Input: Format: roman representation of devnaagri:

e.g. raama for ram

Step1: Single word input: Either enter a text into the text box ,or select a word from the list(on the left hand side).

Page 21: sanskrit lexical analyser

File input:

Browse a file in which Sanskrit words are kept. Also, there shouldn’t be any blank line in the file.

Step2: press sandhi or vibhakti as u desire. NOTE: for processing the word it may take about a second or two, so please wait. If the processing is done on the file, the please wait for around 2-3 minute. Output: Output is displayed to the output window(right side in the frame.) Actually , result of the query is recorded in the result.txt or in tryResult.txt result.txt: if the desired thing could be performed with 100% certainty. tryResult.txt: if only chances/probability of things being performed. Now for viewing the result in devnagri form follow the following steps. Step1: open Itranslator.exe.

Step2: Browse file,result.txt / tryResult.txt (both present in the current directory of project).

Page 22: sanskrit lexical analyser

Use-Case diagram: USE-CASE : 1

Use-case description: getHelp:

action performed: guides the user for using the software.

Page 23: sanskrit lexical analyser

browseFile action performed: browses a file for sandhi –

wigrah/vibhakti

Sandhi: pre condition: a word is selected or entered for getting

the desired work done OR a file is selected action performed: calls for sandhiOrVibhakti(1) //triggerSandhi

post condition: gets the result on output window vibhakti pre condition: a word is selected or entered for getting the

desired work done OR a file is selected action performed: calls for sandhiOrVibhakti(2) //triggerVibhakti

post condition: gets the result on output window uploadWordDB pre condition: a word has been entered in corrosponding box action performed: uploads a word into list database post Condition : a word is appended to words.txt uploadRootWord pre condition: a word has been entered in corrosponding box action performed: uploads a word into dictfile post Condition : a word is appended to dictfile.txt JList1 action performed: gets the list of words in the list(left side large window) oulput action performed: shows the result in the output window(right

Page 24: sanskrit lexical analyser

side large window) USE-CASE : 2

Use-Case2 Discription : Note:(linguistics) sandhi can be possible: it means the word can be broken up into two or more distinct word vibhakti is possible: it means that given the word ,one can get its karak,vachan,ling vibhakti is probable: it means that given the word ,there are some chances that one may detect karak,vachan,ling trigger : pre conditions: 1)A word(NOT NULL) has been received 2)either sandhi or vibhakti has been selected(through GUI ) Action Performed :triggers either the "triggerSandhi" or

Page 25: sanskrit lexical analyser

“triggerVibhakti" Post condition: either sandhiVichched or Vibhakti's is performed.

triggerSandhi: triggerSandhii: pre condition: A word has been received from function "trigger". Action Performed: 1)creates a log file("vigrahNotPossible.txt"),for keeping track of those subwords whose sandhi couldn't be performed.(to increase the efficiency of performing sandhi wigrah)[see details of increasing efficiency] 2)calls sandhiWichched(word) for performing sandhi wigrah on the given word. 3)records the result of action performed in result.txt(if sandhi wigrah has been performed successfully then retrieves the result from result.txt else writes to result.txt word+" couldn't be performed") post condition: result of the action has been recorded triggerVibhakti: pre condition: A word has been received from function "trigger". Action Performed: 1)calls getRoop(word) for getting karak,vachan,ling 2)records the result of action performed in result.txt(if vibhakti is possible, OR only some chances(probability) of vibhakti to be formed) post condition: result of the action has been recorded. getResult: pre condition: name of the file from where result to be extracted has been received. Action Performed: 1)checks "couldn't be formed" is written OR not in "result.txt". 2)if result.txt doesn't contains "couldn't be formed", it means the desired thing has been performed so,retrieve the result. Since,sandhi wichched is a recursive process,so retrieve the

Page 26: sanskrit lexical analyser

result in reverse order. e.g. kastvam = kastu + am. kastu=kaH + tu. while writing the result to the file result.txt, kaH + tu ,would be written first then kastu+am ,would be written. But in order to project the result to the user in a easily readable manner one should project kastu+am first,then kaH + tu. 3)if result.txt contain's "couldn't be formed",it means the desired thing couldn't be performed.So,project the probable results recorded in "tryResult.txt". post condition: User(GUI/frame) gets desired result. USE-CASE : 3

Page 27: sanskrit lexical analyser

USE-CASE:3 Description sandhiWichhed pre condition: receives a word for doing sandhi wichched action performed: 1)tries iteratively for breaking the word into two part(left and right) e.g. if the word is kastu then: kast u kas tu ka stu k astu 2)tries for sandhi wichched.(note: sandhi wigrah is : word=leftWord+rightWord, i.e. ending of leftWord and satarting of rightWord gets connected to form the sandhi now in order to detect the effect of addition,i assumed that effected word is in right e.g. kastu=kaH + tu.... so when i receive "kastu" for sandhi wichched,then i take up left as "ka" and right as "stu",and then pass "ka"(left) and "stu"(right) for trying up sandhi wichched)

post condition: returns true if sandhi vigrah is possible,else returns false

trySandhiwichhed /* sandhi rule format: effectOn--result--left--right e.g. f--s--H|--t|th| */ pre condition : receives two words: left,right (e.g. ka,stu)

Page 28: sanskrit lexical analyser

action performed: 1) sets a variable flag to false. 2) reads a rule from rulefile named as starting letter of right (e.g. here it will open s.txt, since "stu" starts with letter s) 3) calls function applyRule(), for applieng each of the sandhi rules(present in the rule file,here sandhi rules present in s.txt) on left and right(i.e. on ka and stu). 4) if one or more rules listed in corresponding rule file is applicable then flag is set to true. post condition: returns flag(true if the processing possible(i.e. getting sandhi wigrah by applying one of the rules present in rulefile is possible),else returns false.) applyRule pre condition: receives one rule,firstString,secondString e.g. rule : f--s--H|--t|th| first : ka second: stu action performed:tries to apply rule on the firstString,secondString. post condition:returns true if rule is applicable else return false checkWordInDB pre condition: receives a word action performed: checks if the word is present in database post condition: returns true if the word is present in database. else returns false.

Page 29: sanskrit lexical analyser

USE-CASE : 4

USE-CASE:4 Description getRoop pre condition: boolean variables probable and possible set to false receives a word for detecting karak,ling and vachan action performed: applies rules to extract the root word ,it's karak,ling and vachan post condition:set the status by turning on either or both of the boolean variable(possible,probable) breakRuleIntoResult pre condition: receives rule,first word,second word,status(probableOrPossible) action performed: breaks the (i.e. processes ) rule and builds up the result. /* rule : AH--a--1|8--b--m--a first : ram second: AH e.g. root: rama

Page 30: sanskrit lexical analyser

karak: 1|8 vachan: bahuvachan ling: male */ post condition: result is recorded in result.txt(if vibhakti possible),otherwise result is recorded in tryResult.txt(vibhakti is probable). getProbable action performed: returns the status of chances of vibhakti root getPossible action performed: returns the status of chances of vibhakti root USE-CASE : 5

Page 31: sanskrit lexical analyser

USE-CASE:5 Description attchPackagePath: pre condition:

receives a filename to which the pathname of package to be attached. action performed: attaches the relative path of package to the filename. post condition: relative path of the filename is returned. createFile: pre conditions: 1) receives a file name for creating the file 2) receives some string which is to be written to the file.(may be "",i.e. blank) Action performed: creates a file with the name provided, and appends the string to the filename. post condition: file has been created and string has been appended. breakIntoFile (for administrator) pre conditions: 1) receives a file which is to be broken into subfiles based on some parameter.

(parameter: read each line and based on starting letter of the line,put that line in the file(name as startingLetter.txt))

2)receives the folder name in which the subfiles are to be stored. Action performed: reads the file fileName line-by-line and appends each line to the file based on starting letter of the line.(used mainly for organising the database/rules,based on starting letters) post condition: file has been broken into subfiles.

Page 32: sanskrit lexical analyser

getContentWholeFile pre conditions:

receives the fileName from which the content to be retrieved. Action performed: reads the file line-by-line and concat each line into one string. post condition string containing the content of whole file is returned. appendToFile pre condition:

receives the filename and string to be appended to the file. Action performed:

appends the string to the filename. post condition: string is appended to the file. checkWordInFile pre condition: receives filename and word to be checked in the file. (file contains only one word per line) Action Performed:

reads a word from filename and compares it with the word received for matching.

post condition:returns true if word is present in the file else returns false. checkAndFetchWordInRuleFile(used for vibhakti only) Note: rule format for vibhakti: result--add--vibhakti--vachan--ling--endingWord pre condition: receives filename(rule file) and word(rule's result part) to be checked in the file.(file contains only one word(rule) per line)

Page 33: sanskrit lexical analyser

Action Performed: reads a word(rule) from filename and compare it's result part (contained in rule) with the word(rule's result) received for matching.

post condition: returns true if the rule(corresponding to the result part) is present in the rulefile.

checkWordInRightOrSeparatedWord(used for sandhi only) Note: rule format for sandhi: Rule format in the text rule file ): effectOn--result--left--right effectOn: character f result obtained by making changes in first s second b both result: string left: OR separated strings string|string|......|string| right: OR separated strings string|string|......|string| pre condition:

receives a word and orSeparatedString(string|string|......|string|) Action performed: checks if the word starts with one of the string present in orSeparatedString(string|string|......|string|) post condition:

returns true(if starts) or false(if doesn't starts)

Page 34: sanskrit lexical analyser

checkWordInRightOrSeparatedWord(used for sandhi only) Note: rule format for sandhi: Rule format in the text rule file ): effectOn--result--left--right effectOn: character f result obtained by making changes in first s second b both result: string left: OR separated strings string|string|......|string| right: OR separated strings string|string|......|string| pre condition:

receives a word and orSeparatedString(string|string|......|string|) Action performed: checks if the word ends with one of the string present in orSeparatedString(string|string|......|string|) post condition:

returns true(if does ends) or false(if doesn't ends)

Page 35: sanskrit lexical analyser

CLASS-DIAGRAM :

Description

Page 36: sanskrit lexical analyser
Page 37: sanskrit lexical analyser
Page 38: sanskrit lexical analyser

Architecture Overview

Sanskrit Server

Vibhakti Sandhi

Vibhakti Utils Sandhi Utils

Database Vibhakti Sandhi Rules

Result Query

Root words Non root

words

Based on Starting letter

Based on Starting letter

User queries(through GUI) is handled by the Sanskrit Server. Depending upon the Sandhi / Vibhakti to be performed, it passes the query to the respective server. Respective server, taking help of utilities, interacts with the Rules database and Word database.Does some processing and gives result back to the Sanskrit server,which gives back the result to the result server. NOTE: These servers(right now a java program),later on may be implemented in the distributed manner.

Page 39: sanskrit lexical analyser

Risk Involved:

Unable to handle the Sanskrit words like At @vaiSm i.e. sanskrit words with blanks. If the input is through file, then file shouldn’t contain any blank line. References Books referred:

• Sanskrit path pradarshak[Shanti Swaroop Dixit & Dr. milaapSingh] • Sugham sanskrit vyakaran[Chandrakant jhaa] • Natural language processing[akshal bharti, vineet chaitanya,rajeev sanghal]

Web references:

• http://sanskritdeepika.org • http://www.omkarananda-ashram.org • http://www.gitasupersite.iitk.ac.in