33
IndoWordNet Database Design Presented By: Konkani NLP Team Goa University IndoWordNet Database Design 1

IndoWordNet Database Design Presented By: Konkani NLP Team Goa University IndoWordNet Database Design 1

Embed Size (px)

Citation preview

1

IndoWordNet Database Design

Presented By:Konkani NLP Team

Goa University

IndoWordNet Database Design

2

Brief Outline• Objectives

• Background

• Requirements

• Proposed database design

– Database design details

– Issues to be resolved

– Tools and Scripts

• API’s

– IndoWordNet API

– Layers of API

– Class Diagram

– Sample API code IndoWordNet Database Design

IndoWordNet Database Design 3

Objectives

• To finalize the database design.

• To finalize tools/script necessary for distributing the database.

• API design.

• API demonstration.

IndoWordNet Database Design 4

Background

• IndoWordNet is a multilingual WordNet that links WordNets of different Indian languages.

• A WordNet is a crucial resource for a language which aids in NLP tasks such as Machine Translation, Information Retrieval, etc.

• Databases necessary maintain the data for one or multiple WordNets.

• Database needs to support development of online and offline applications.

IndoWordNet Database Design 5

Requirements

• Database design should accommodate multiple languages.

• Store synsets of different languages. • Store semantic relations.• Store lexical relationships. • Store ontological details.• Allow any additional information to be stored for each synset.

• Avoid duplication of data.• Open, scalable, modular design.• Independent of storage technology.

IndoWordNet Database Design 6

Proposed Database Design• Software Platform: Reference implementation done using

Mysql– Mysql is freeware– Supported by Windows & Linux O.S

Database design details– wordnet_master• It contains language independent data.

– wordnet_<respective_language>• It contains language dependent data.• It contains the synset data for a language.

– wordnet_admin• It contains data necessary for administrative purpose.

IndoWordNet Database Design 7

wordnet_master

• The wordnet_master maintains the data shared by all the languages.

• The wordnet_master includes tables for semantic relations.

• It will include all ontology related tables in English.

• The language specific data will be available in the wordnet_<respective_language> database.

IndoWordNet Database Design 8

List of tables in wordnet_master– wn_master_category• To maintain the different grammatical categories such as noun, verb,

etc.

– wn_master_language• To maintain the language information in a database.

– wn_master_language_lss_range• To maintain language specific synset range w.r.t. the given language.

– wn_master_synset_file• To associate a file with a synset.

IndoWordNet Database Design 9

Tables for maintaining semantic relation

• wn_rel_hypernymy_hyponymy – To maintain the hypernymy and hyponymy type of a relation which is a IS-A-

KIND-OF type of a semantic relationship between synsets.

• wn_rel_meronymy_holonymy – To maintain the meronymy and holonymy type of a relation which is a PART-

WHOLE type of a semantic relationship between synsets.

• wn_rel_troponymy– To maintain the troponymy type of a semantic relationship between synsets.

• wn_rel_causative– To maintain the causative semantic relation between synsets.

IndoWordNet Database Design 10

• wn_rel_entailment– To maintain the entailment type of a semantic relationship between

synsets.

• wn_rel_similar– To maintain the relation between similar types of synsets

• wn_rel_also_see – To maintain the relation between synsets other than the regular

semantic relations .

• wn_rel_noun_verb_link – To maintain the semantic relation between synsets namely a noun

synset and associated verb synset.

IndoWordNet Database Design 11

• wn_rel_noun_adjective_attribute_link – To maintain the semantic relation between synsets, namely a noun synset and

associated adjective attribute that go together.

• wn_rel_adjective_modifies_noun– To maintain the semantic relation between synsets namely an adjective synset

and the corresponding noun synset which it modifies.

• wn_rel_adverb_modifies_verb – To maintain the semantic relation between synsets namely an adverb synset

and the corresponding verb synset which it modifies.

• wn_rel_near_synsets– To maintain the near synsets relation between synsets.

IndoWordNet Database Design 12

• wn_property_antonymy_gradation – To maintain the different types of relation properties, like antonym relation

have properties such as colour, gender, etc.

wn__property_meronymy_holonymy– To maintain the different types of relation properties, for relations like

meronymy, holonymy that have properties like component-object, feature-activity, etc.

• wn_relation_types– To maintain the relation information of all the relation tables.

• wn_semantic_relations– To maintain the semantic relations w.r.t. the synsets.

IndoWordNet Database Design 13

Tables for maintaining ontology relation

• wn_ontology_nodes– To maintain the different ontology types or positions. (Common

information in English)

• wn_ontology_tree– To maintain the hierarchical relationship of the ontology types. – The root node in the ontology hierarchy has id value = 1.

• wn_ontology_synset_map– To link a synset/concept to a particular position in the ontology.

14

wordnet_<respective_language>

• The wordnet_<respective_language> database will keep tables which will have information related to the particular language.

• It will include tables to keep synset details, words in the language, examples, etc.

• <respective_language> is to be replaced by any of the languages of the IndoWordNet group. viz. Assamese, Hindi, Konkani, Oriya, Punjabi, Urdu, etc as applicable.

• wordnet_bodo

IndoWordNet Database Design

IndoWordNet Database Design 15

wordnet_admin

• This database is used to keep other related tables such as:– Feedback table– FAQ table– Website administration tables– User + password table– …

IndoWordNet Database Design 16

Fig 1: Some of the important tables which are part of the WordNet with colour coding to show common data shared by all languages and data different for each language

Language dependent data

Language independent data

IndoWordNet Database Design 17

Issues to be Resolved• The tables below:– wn_rel_adjective_modifies_noun– wn_rel_adverb_modifies_verb– wn_rel_noun_adjective_attribute_link– wn_rel_noun_verb_link

-are to be stored as Language independent data or Language dependent data? ( in view of change in POS category reported by language groups)

IndoWordNet Database Design 18

• In table ‘wn_ontology_nodes’ the data should be only in English and the data in other language can be kept in their respective language database.

• Need to be done NOW To approve master and <respective_language>

tables of each language.

IndoWordNet Database Design 19

Tools & Scripts• Tool to populate data into the various tables of the database.

• Population of data into tables such as– wn_synset– wn_word– wn_synset_words– wn_synset_example

• Scripts to create language specific data tables.

• Scripts to dump and restore data.

• Scripts to manage/update incremental changes done to tables in wordnet_master

IndoWordNet Database Design 20

Graphical User Interface to Populate data into the database Tables

IndoWordNet Database Design 21

Questions?

IndoWordNet Database Design 22

API’s• An Application Programming Interface (API) is a set of commands,

functions and protocols which programmers can use when building a software.

• It allows the programmers to use predefined functions to interact with systems, instead of writing them from scratch.

• Characteristics of good API– Easy to learn and use, Hard to misuse.– Easy to read and maintain code that uses it.– Is programming language neutral.– Sufficiently powerful to support all computational requirements.

23

IndoWordNet API• It allows a user to use the API without the knowledge of the database

design.

• The API is object-oriented design.

• The API is designed in such a way that it supports single/multiple languages.

• API design consist of two layers:– Application layer– Database layer

• The Database layer will change depending on the DBMS but the Application layer will mostly remain unchanged.

IndoWordNet Database Design

IndoWordNet Database Design 24

Application layer

• The Application layer incorporates the logical part of the IndoWordNet requirements, so as to provide classes and objects to perform all the operations to be performed on the synset, relations, ontology, other master data, etc.

• Reference Implementation is being done in Java and PHP.

IndoWordNet Database Design 25

Application Layer consists of the following classes:

– IWAPIClass• A class that allows to initialise API library for use.• Maintain master tables.• Manage connectivity to language specific databases.

– IWSynset• A class that represents a Synset

– IWWord• A class that represents a Word

– IWSynsetCollection• Collection of Synsets

– IWWordCollection• Collection of words for a synset

– IWOntology• A class that represents Ontology• Each synset is mapped into some place in the ontology tree

IndoWordNet Database Design 26

– IWOntologyCollection• Collection of child nodes for a given onto node

– IWExampleCollection• Collection of examples

– IWFile• A class that represents a File

– IWDataFile• A class that represents a data file

– IWPictureFile• A class that represents a picture files

– IWFileCollection• Collection of files

IndoWordNet Database Design 27

• The Application Layer allows us to perform operations such as:– get all the synsets– get various relations for a given synset/ word– get words for a given synset– add a new source or domain– add a new relation– update the records in the table– delete a synset/ source/ domain– modify ontology information

IndoWordNet Database Design 28

Database layer

• The Database layer deals with encapsulation of the database design.

• It provides a standard interface to the application layer.

• The Database layer supports all the operations needed to be performed on the database.

IndoWordNet Database Design 29

Database Layer consists of the following classes:

– IWDb• A class that connects to a Language Dependent Database.

– IWCon• A class that sets up a connection to a database

– IWStatement• A class which contains all the queries pertaining to the application layer• Also the basic functions such as updation, deletion, insertion, selection, etc.

– IWResult• A class which returns results to the application layer, the results of executed queries

– IWField• A class which returns to the application, the proper data-type irrespective of the db

data-type or vice versa

IndoWordNet Database Design 30

Class Diagram

IndoWordNet Database Design 31

Sample API code• Set up of database connection

– IWDb dbobject = new IWDb ( IWAPIClass.Language_Name);

• Create object for synset– IWSynset synsetobject = new IWSynset ( synsetID, dbobject );

• Get concept for a synset– String concept = synsetobject.getConcept();

• Set concept to a synset– boolean flag = synsetobject.setConcept (“ conceptDefination ”);

• Get word collection for a synset

– IWWordCollection words = synsetobject.getWords();

IndoWordNet Database Design 32

Questions?

IndoWordNet Database Design 33

THANK YOU