Speech Convcerter

1. INTRODUCTION

Machine Translation is a great example of how cutting edge research and world class

infrastructure come together at Google. We focus our research efforts towards developing

statistical translation techniques that improve with more data and generalize well to new

languages. Our large scale computing infrastructure allows us to rapidly experiment with new

models trained on web-scale data to significantly improve translation quality. This cutting-

edge research backs the translations served at translate.google.com, allowing our users to

translate text, web pages and even speech. Deployed within a wide range of Google services

like GMail, Books, Android and web search, Google Translate is a high impact, research

driven product that bridges the language barrier and makes it possible to explore the

multilingual web in 63 languages. Exciting research challenges abound as we pursue human

quality translation and develop machine translation systems for new languages. However,

adopting an “always-on” strategy has several disadvantages.

In this work we present two extensions to the well-known dynamic programming beam

search in phrase-based statistical machine translation (SMT), aiming at increased efficiency

of decoding by minimizing the number of language model computations and hypothesis

expansions. Our results show that language model based pre-sorting yields a small

improvement in translation quality and a speedup by a factor of 2. Two look-ahead methods

are shown to further increase translation speed by a factor of 2 without changing the search

space and a factor of 4 with the side-effect of some additional search errors. We compare our

approach with Moses and observe the same performance, but a substantially better trade-off

between translation quality and speed. At a speed of roughly 70 words per second, Moses

reaches 17.2% BLEU, whereas our approach yields 20.0% with identical models.

When trained on very large parallel corpora, the phrase table component of a machine

translation system grows to consume vast computational resources. In this paper, we

introduce a novel pruning criterion that places phrase table pruning on a sound theoretical

foundation. Systematic experiments on four language pairs under various data conditions

show that our principled approach is superior to existing ad hoc pruning methods. We

propose an unsupervised method for clustering the translations of a word, such that the

translations in each cluster share a common semantic sense. Words are assigned to clusters

based on their usage distribution in large monolingual and parallel corpora using the soft K-

Means algorithm. In addition to describing our approach, we formalize the task of translation

sense clustering and describe a procedure that leverages WordNet for evaluation. By

comparing our induced clusters to reference clusters generated from WordNet, we

demonstrate that our method effectively identifies sense-based translation clusters and

benefits from both monolingual and parallel corpora. Finally, we describe a method for

annotating clusters with usage examples.

Our Contributions

We present a simple and effective infrastructure for domain adaptation for statistical

machine translation (MT). To build MT systems for different domains, it trains, tunes and

deploys a single translation system that is capable of producing adapted domain translations

and preserving the original generic accuracy at the same time. The approach unites automatic

domain detection and domain model parameterization into one system. Experiment results on

20 language pairs demonstrate its viability. Translating compounds is an important problem

in machine translation. Since many compounds have not been observed during training, they

pose a challenge for translation systems. Previous decompounding methods have often been

restricted to a small set of languages as they cannot deal with more complex compound

forming processes. We present a novel and unsupervised method to learn the compound parts

and morphological operations needed to split compounds into their compound parts. The

method uses a bilingual corpus to learn the morphological operations required to split a

compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set

of compound part candidates. We evaluate our method within a machine translation task and

show significant improvements for various languages to show the versatility of the approach.

1.1 Overview

Machine Translation is an important technology for localization, and is particularly

relevant in a linguistically diverse country like India. In this document, we provide a brief

survey of Machine Translation in India. Human translation in India is a rich and ancient

tradition. Works of philosophy, arts, mythology, religion, science and folklore have been

translated among the ancient and modern Indian languages. Numerous classic works of art,

ancient, medieval and modern, have also been translated between European and Indian

languages since the 18th century.

Problem Statement

In the current era, human translation finds application mainly in the administration,

media and education, and to a lesser extent, in business, arts and science and technology.

India has a linguistically rich area—it has 18 constitutional languages, which are written in

10 different scripts. Hindi is the official language of the Union. English is very widely used in

the media, commerce, science and technology and education. Many of the states have their

own regional language, which is either Hindi or one of the other constitutional languages.

Only about 5% of the population speaks English. In such a situation, there is a big market for

translation between English and the various Indian languages. Currently, this translation is

essentially manual. Use of automation is largely restricted to word processing. Two specific

examples of high volume manual translation are—translation of news from English into local

languages, translation of annual reports of government departments and public sector units

among, English, Hindi and the local language.

As is clear from above, the market is largest for translation from English into Indian

languages, primarily Hindi. Hence, it is no surprise that a majority of the Indian Machine

Translation (MT) systems are for English-Hindi translation. As is well known, natural

language processing presents many challenges, of which the biggest is the inherent ambiguity

of natural language. MT systems have to deal with ambiguity, and various other NL

phenomena. In addition, the linguistic diversity between the source and target language

makes MT a bigger challenge. This is particularly true of widely divergent languages such as

English and Indian languages.

The major structural difference between English and Indian languages can be

summarized as follows. English is a highly positional language with rudimentary

morphology, and default sentence structure as SVO. Indian languages are highly inflectional,

with a rich morphology, relatively free word order, and default sentence structure as SOV. In

addition, there are many stylistic differences. For example, it is common to see very long

sentences in English, using abstract concepts as the subjects of sentences, and stringing

several clauses together (as in this sentence!). Such constructions are not natural in Indian

languages, and present major difficulties in producing good translations. As is recognized the

world over, with the current state of art in MT, it is not possible to have Fully Automatic,

High Quality, and General-Purpose Machine Translation. Practical systems need to handle

ambiguity and the other complexities of natural language processing, by relaxing one or more

of the above dimensions.

1.2 LITERATURE SURVEY

Natural Language Processing (NLP) is an area of research and application that explores how

computers can be used to understand and manipulate natural language text or speech to do

useful things. NLP researchers aim to gather knowledge on how human beings understand

and use language so that appropriate tools and techniques can be developed to make

computer systems understand and manipulate natural languages to perform the desired tasks.

The foundations of NLP lie in a number of disciplines, viz. computer and information

sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence

and robotics, psychology, etc. Applications of NLP include a number of fields of studies,

such as machine translation, natural language text processing and summarization, user

interfaces, multilingual and cross language information retrieval (CLIR), speech recognition,

artificial intelligence and expert systems, and so on.

One important area of application of NLP that is relatively new and has not been covered in

the previous ARIST chapters on NLP has become quite prominent due to the proliferation of

the world wide web and digital libraries. Several researchers have pointed out the need for

appropriate research in facilitating multi- or cross-lingual information retrieval, including

multilingual text processing and multilingual user interface systems, in order to exploit the

full benefit of the www and digital libraries.

Scope

Several ARIST chapters have reviewed the field of NLP. The most recent ones

include that by Warner in 1987, and Haas in 1996. Reviews of literature on large-scale NLP

systems, as well as the various theoretical issues have also appeared in a number of

publications (see for example, Jurafsky & Martin, 2000; Manning & Schutze, 1999; Mani &

Maybury, 1999; Sparck Jones, 1999; Wilks, 1996). Smeaton (1999) provides a good

overview of the past research on the applications of NLP in various information retrieval

tasks. Several ARIST chapters have appeared on areas related to NLP, such as on machine-

readable dictionaries (Amsler, 1984;Evans, 1989), speech synthesis and recognition (Lange,

1993), and cross-language information retrieval (Oard & Diekema, 1998). Research on NLP

is regularly published in a number of conferences such as the annual proceedings of ACL

(Association of Computational Linguistics) and its European counterpart EACL, biennial

proceedings of the International Conference on Computational Linguistics (COLING), annual

proceedings of the Message Understanding Conferences (MUCs), Text Retrieval Conferences

(TRECs) and ACM-SIGIR (Association of Computing Machinery – Special Interest Group

on Information Retrieval) conferences. The most prominent journals reporting NLP research

are Computational Linguistics and Natural Language Engineering. Articles reporting NLP

research also appear in a number of information science journals such as Information

Processing and Management, Journal of the American Society for Information Science and

Technology, and Journal of Documentation. Several researchers have also conducted domain-

specific NLP studies and have reported them in journals specifically dealing with the domain

concerned, such as the International Journal of Medical Informatics and Journal of Chemical

Information and Computer Science.

Beginning with the basic issues of NLP, this chapter aims to chart the major research

activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including:

i. natural language text processing systems – text summarization, information

extraction, information retrieval, etc., including domain-specific applications;

ii. natural language interfaces;

iii. NLP in the context of www and digital libraries ; and

iv. evaluation of NLP systems.

Linguistic research in information retrieval has not been covered in this review, since

this is a huge area and has been dealt with separately in this volume by David Blair.

Similarly, NLP issues related to the information retrieval tools (search engines, etc.) for web

search are not covered in this chapter since a separate chapter on indexing and retrieval for

the Web has been written in this volume by Edie Rasmussen. Tools and techniques developed

for building NLP systems have been discussed in this chapter along with the specific areas of

applications for which they have been built. Although machine translation (MT) is an

important part, and in fact the origin, of NLP research, this paper does not cover this topic

with sufficient detail since this is a huge area and demands a separate chapter on its own.

Similarly, cross-language information retrieval (CLIR), although is a very important and big

area in NLP research, is not covered in great detail in this chapter. A separate chapter on

CLIR research appeared in ARIST (Oard & Diekema, 1998). However, MT and CLIR have

become two important areas of research in the context of the www digital libraries. This

chapter reviews some works on MT and CLIR in the context of NLP and IR in digital

libraries and www. Artificial Intelligence techniques, including neural networks etc., used in

NLP have not been included in this chapter.

Some Theoretical Developments

Previous ARIST chapters (Haas, 1996; Warner, 1987) described a number of

theoretical developments that have influenced research in NLP. The most recent theoretical

developments can be grouped into four classes: (i) statistical and corpus-based methods in

NLP, (ii) recent efforts to use WordNet for NLP research, (iii) the resurgence of interest in

finite-state and other computationally lean approaches to NLP, and (iv) the initiation of

collaborative projects to create large grammar and NLP tools. Statistical methods are used in

NLP for a number of purposes, e.g., for word sense disambiguation, for generating grammars

and parsing, for determining stylistic evidences of authors and speakers, and so on. Charniak

(1995) points out that 90% accuracy can be obtained in assigning part-of-speech tag to a

word by applying simple statistical measures. Jelinek (1999) is a widely cited source on the

use of statistical methods in NLP, especially in speech processing. Rosenfield (2000) reviews

statistical language models for speech processing and argues for a Bayesian approach to the

integration of linguistic theories of data. Mihalcea & Moldovan (1999) mention that although

thus far statistical approaches have been considered the best for word sense disambiguation,

they are useful only in a small set of texts.

Figure 1 Langauge Translation - Using NLP

They propose the use of WordNet to improve the results of statistical analyses of

natural language texts. WordNet is an online lexical reference system developed at Princeton

University. This is an excellent NLP tool containing English nouns, verbs, adjectives and

adverbs organized into synonym sets, each representing one underlying lexical concept.

Details of WordNet is available in Fellbaum (1998) and on the web

(http://www.cogsci.princeton.edu/~wn/). WordNet is now used in a number of NLP research

and applications. One of the major applications of WordNet in NLP has been in Europe with

the formation EuroWordNet in 1996. EuroWordNet is a multilingual database with WordNets

for several European languages including Dutch, Italian, Spanish, German, French, Czech

and Estonian, structured in the same way as the WordNet for English

(http://www.hum.uva.nl/~ewn/). The finite-state automation is the mathematical device used

to implement regular expressions – the standard notation for characterizing text sequences.

Variations of automata such as finite-state transducers, Hidden Markov Models, and n-gram

grammars are important components of speech recognition and speech synthesis, spell-

checking, and information extraction which are the important applications of NLP. Different

applications of the Finite State methods in NLP have been discussed by Jurafsky & Martin

(2000), Kornai (1999) and Roche & Shabes (1997). The work of NLP researchers has been

greatly facilitated by the availability of large-scale grammar for parsing and generation.

Researchers can get access to large-scale grammars and tools through several websites, for

example Lingo (http://lingo.stanford.edu), Computational Linguistics & Phonetics

(http://www.coli.uni-sb.de/software.phtml), and Parallel grammar project

(http://www.parc.xerox.com/istl/groups/nltt/pargram/).

Another significant development in recent years is the formation of various national

and international consortia and research groups that can facilitate, and help share expertise,

research in NLP. LDC (Linguistic Data Consortium) (http://www.ldc.upenn.edu/) at the

University of Pennsylvania is a typical example that creates, collects and distributes speech

and text databases, lexicons, and other resources for research and development among

universities, companies and government research laboratories. The Parallel Grammar project

is another example of international cooperation. This is a collaborative effort involving

researchers from Xerox PARC in California, the University of Stuttgart and the University of

Konstanz in Germany, the University of Bergen in Norway, Fuji Xerox in Japan. The aim of

this project is to produce wide coverage grammars for English, French, German, Norwegian,

Japanese, and Urdu which are written collaboratively with a commonly-agreed-upon set of

grammatical features (http://www.parc.xerox.com/istl/groups/nltt/pargram/). The recently

formed Global WordNet Association is yet another example of cooperation. It is a non-

commercial organization that provides a platform for discussing, sharing and connecting

WordNets for all languages in the world. The first international WordNet conference to be

held in India in early 2002 is expected to address various problems of NLP by researchers

from different parts of the world.

2. SYSTEM ANALYSIS

2.1 Existing System

At the core of any NLP task there is the important issue of natural language

understanding. The process of building computer programs that understand natural language

involves three major problems: the first one relates to the thought process, the second one to

the representation and meaning of the linguistic input, and the third one to the world

knowledge. Thus, an NLP system may begin at the word level – to determine the

morphological structure, nature (such as part-of-speech, meaning) etc. of the word – and then

may move on to the sentence level – to determine the word order, grammar, meaning of the

entire sentence, etc.— and then to the context and the overall environment or domain. A

given word or a sentence may have a specific meaning or connotation in a given context or

domain, and may be related to many other words and/or sentences in the given context.

Liddy (2011) and Feldman (2012) implemented that in order to understand natural

languages, it is important to be able to distinguish among the following seven interdependent

levels, that people use to extract meaning from text or spoken languages:

phonetic or phonological level that deals with pronunciation

morphological level that deals with the smallest parts of words, that carry a meaning,

and suffixes and prefixes

lexical level that deals with lexical meaning of words and parts of speech analyses

syntactic level that deals with grammar and structure of sentences

semantic level that deals with the meaning of words and sentences

discourse level that deals with the structure of different kinds of text using document

structures and

Pragmatic level that deals with the knowledge that comes from the outside world, i.e.,

from outside the contents of the document.

A natural language processing system may involve all or some of these levels of analysis.

2.1.1 Disadvantages

When translation is required from one language to another, for example from French

to English, there are three basic methods that can be employed:

the translation of each phrase on a word for word basis,

hiring someone who speaks both languages, or

using translation software.

Using simple dictionaries for a word by word translation is very time consuming and can

often result in errors. Many words have different meanings in various contexts. And if the

reader of the translated material finds the wording funny, that can be a bad reflection on your

business. Allowing the gist of your material to be lost in translation can therefore mean the

loss of clients.

Hiring someone who speaks a couple of languages generally leads to much better

results. Therefore this option can be fine for small projects with an occasional need for

translations. When you need your information translated to several different languages

however, things get more complicated. In that situation you will probably need to find more

than one translator. Moreover; the translated sentences have:

No meaning full Sentences

Uses garbage collected variables for special characters

Not much reliable in Document Conversions

Much time to load in lower bandwidth

Minimal support in oldest web browsers

Required third-party scripting language to process

2.2 Proposed System

An apparatus for translating a series of source words in a first language to a series of

target words in a second language. For an input series of source words, at least two target

hypotheses, each including a series of target words, are generated. Each target word has a

context comprising at least one other word in the target hypothesis. For each target

hypothesis, a language model match score including an estimate of the probability of

occurrence of the series of words in the target hypothesis. At least one alignment connecting

each source word with at least one target word in the target hypothesis is identified. For each

source word and each target hypothesis, a word match score including an estimate of the

conditional probability of occurrence of the source word, given the target word in the target

hypothesis which is connected to the source word and given the context in the target

hypothesis of the target word which is connected to the source word. For each target

hypothesis, a translation match score including a combination of the word match scores for

the target hypothesis and the source words in the input series of source words. A target

hypothesis match score including a combination of the language model match score for the

target hypothesis and the translation match score for the target hypothesis. The target

hypothesis having the best target hypothesis match score is output.

The technique of creating interpolated language models for different contexts has

been used with success in a number of conversational interfaces [1, 2, 3] In this case, the

pertinent context is the system’s “dialogue state”, and it’s typical to group transcribed

utterances by dialogue state and build one language model per state. Typically, states with

little data are merged, and the state-specific language models are interpolated, or otherwise

merged. Language models corresponding to multiple states may also be interpolated, to share

information across similar states. The technique we develop here differs in two key respects.

First, we derive interpolation weights for thousands of recognition contexts, rather than a

handful of dialogue states. This makes it impractical to create each interpolated language

model offline and swap in the desired one at runtime. Our language models are large, and we

only learn the recognition context for a particular utterance when the audio starts to arrive.

Second, rather than relying on transcribed utterances from each recognition context to train

state-specific language modes, we instead interpolate a small number of language models

trained from large corpora.

2.2.1 Advantage

1. Understandable. For instance, if we translate an English text to our mother language

which is Malay, it is much more understandable by us.

2. Gain knowledge. Some say, no pain no gain. So, translating a literary text is no pain.

But just to say that we've to put effort on doing that. Literature is one of the branches

in learning a language. Therefore, we can know more or less on the literary texts like

Shakespeare's poems and such.

3. Widen vocabulary. Yada yada, we know that literary texts use all the Shakespeare's

bombastic classic English words, and by translating it into other languages might also

use those super-bombastic words, hence increasing our vocabulary indirectly.

4. Discipline your mind. As for those who are in a literature field, they can discipline

their minds by studying, researching and discovering new words and even cultures

that are in the texts that they translate. As a result, we will have our own experts on

translating literary texts that we do not have to import them.

5. Knowing history. We can learn and know the history in the literary text itself. For

example, the foreigners can know more on history of Malaysia by reading the Hikayat

Tun Sri Lanang and so forth and vice versa, we can know the other countries' cultures

by learning their literary texts. This will also lead to the knowledge of cultures,

politics and customs.

6. An efficient packet classification algorithm was introduced by hashing flow IDs held

in digest caches for reduced memory requirements at the expense of a small amount

of packet misclassification.

2.3 Feature Work

Register or context of situation is set of vocabularies and their meanings,

configuration of semantic pattern, along with words and structures such as double negative

(among black American) used in realization of these meanings.It relates variation of

language use to variations of social context. Every context has its distinctive vocabularies.

You can see a great difference in vocabularies used by mechanics in a garage and that of

doctors. Selection of meanings constitute variety to which a text belongs.

Halliday discusses the term Register in detail. This term refers to the relationship between

language (and other semiotic forms) and the features of the context. All along, we have been

characterizing this relationship (which we can now call register) by using the descriptive

categories of Field, Tenor, and Mode. Registers vary. There are clues or indices in the

language that help us predict what the register of a given text (spoken or written) is. Halliday

uses the example of the phrase "Once upon a time" as an indexical feature that signals to us

that we're about to read or hear a folk tale. Halliday also distinguishes between register and

another kind of language variety, dialect. For Halliday, dialect variety is a variety according

to the user. Dialects can be regional or social. Register is a variety according to use, or the

social activity in which you are engaged. Halliday says, "dialects are saying the same thing in

different ways, whereas registers are saying different things."

Register Variables delineate relationship between language function and language

form. To have a clear understanding of language form and function, we have an example

here. Consider words cats and dogs. Final s in both has the same written form. In cats it is

pronounced /s/, but in dogs it is pronounced /z/, so they have different spoken form. It

functions the same in both because it turns them into plural form. Language functions are also

of great importance.Some of language functions are vocative, aesthetic, phatic, metalingual,

informative, descriptive, expressive and social. Among them the last four ones are more

important here, so let's take a brief look at them.

Descriptive function gives actual information. You can test this information, then accept or

reject it.(It's – 10° outside. If it's winter it can be accepted. But in summer it will be rejected

in normal situation.).

Expressive function supplies information about speaker and his/her fleeing.(I don't invite her

again. It is implied that the speaker didn't like her in the first meeting.). Newmark believes

the core of expressive function is the mind of speaker, the writer or the originator of the

utterance. He uses the utterance to express his feelings irrespective of any response.

Social function shows particular relationship between speaker and listener.(Will that be all

sir? The sentence implies the context of a restaurant.).

Informative function Newmark believes the core of informative function of language is

external situation, the facts of a topic, reality outside language, including reported ideas or

theories. The format of an informative text is often standard: a textbook, a technical report, an

article in newspaper or a periodical, a scientific paper, a thesis, minutes or agenda of a

meeting

2.4 Feasibility Study

A feasibility study, also known as feasibility analysis, is an analysis of the viability of

an idea. It describes a preliminary study undertaken to determine and document a project’s

viability. The results of this analysis are used in making the decision whether to proceed with

the project or not. This analytical tool used during the project planning phrase shows how a

business would operate under a set of assumption, such as the technology used, the facilities

and equipment, the capital needs, and other financial aspects. The study is the first time in a

project development process that show whether the project create a technical and

economically feasible concept. As the study requires a strong financial and technical

background, outside consultants conduct most studies.

A feasible project is one where the project could generate adequate amount of cash

flow and profits, withstand the risks it will encounter, remain viable in the long-term and

meet the goals of the business. The venture can be a start-up of the new business, a purchase

of the existing business, and expansion of the current business. Consequently, costs and

benefits are estimated with greater accuracy at this stage.

Feasibility Considerations:

Three key considerations are involved in the feasibility study.

1. Economic feasibility

2. Technical feasibility

3. Operational feasibility

2.4.1 Economic Feasibility

Economic analysis could also be referred to as cost/benefit analysis. It is the most

frequently used method for evaluating the effectiveness of a new system. In economic

analysis the procedure is to determine the benefits and savings that are expected from a

candidate system and compare them with costs. If benefits outweigh costs, then the decision

is made to design and implement the system. An entrepreneur must accurately weigh the cost

versus benefits before taking an action.

Possible questions raised in economic analysis are:

Is the system cost effective?

Do benefits outweigh costs?

The cost of doing full system study

The cost of business employee time

Estimated cost of hardware

Estimated cost of software/software development

Is the project possible, given the resource constraints?

What are the savings that will result from the system?

Cost of employees' time for study

Cost of packaged software/software development

Selection among alternative financing arrangements (rent/lease/purchase)

The concerned business must be able to see the value of the investment it is pondering

before committing to an entire system study. If short-term costs are not overshadowed by

long-term gains or produce no immediate reduction in operating costs, then the system is not

economically feasible, and the project should not proceed any further. If the expected benefits

equal or exceed costs, the system can be judged to be economically feasible. Economic

analysis is used for evaluating the effectiveness of the proposed system.

The economic feasibility will review the expected costs to see if they are in-line with

the projected budget or if the project has an acceptable return on investment. At this point, the

projected costs will only be a rough estimate. The exact costs are not required to determine

economic feasibility. It is only required to determine if it is feasible that the project costs will

fall within the target budget or return on investment. A rough estimate of the project schedule

is required to determine if it would be feasible to complete the systems project within a

required timeframe. The required timeframe would need to be set by the organization.

2.4.2 Technical Feasibility

A large part of determining resources has to do with assessing technical feasibility. It

considers the technical requirements of the proposed project. The technical requirements are

then compared to the technical capability of the organization. The systems project is

considered technically feasible if the internal technical capability is sufficient to support the

project requirements.

The analyst must find out whether current technical resources can be upgraded or

added to in a manner that fulfils the request under consideration. This is where the expertise

of system analysts is beneficial, since using their own experience and their contact with

vendors they will be able to answer the question of technical feasibility.

The essential questions that help in testing the operational feasibility of a system include the

following:

Is the project feasible within the limits of current technology?

Does the technology exist at all?

Is it available within given resource constraints?

Is it a practical proposition?

Manpower- programmers, testers & debuggers

Software and hardware

Are the current technical resources sufficient for the new system?

Can they be upgraded to provide to provide the level of technology necessary for the

new system?

Do we possess the necessary technical expertise, and is the schedule reasonable?

Can the technology be easily applied to current problems?

Does the technology have the capacity to handle the solution?

Do we currently possess the necessary technology?

2.4.3 Operational Feasibility

Operational feasibility is dependent on human resources available for the project and involves

projecting whether the system will be used if it is developed and implemented. Operational

feasibility is a measure of how well a proposed system solves the problems, and takes

advantage of the opportunities identified during scope definition and how it satisfies the

requirements identified in the requirements analysis phase of system development.

Operational feasibility reviews the willingness of the organization to support the proposed

system. This is probably the most difficult of the feasibilities to gauge. In order to determine

this feasibility, it is important to understand the management commitment to the proposed

project. If the request was initiated by management, it is likely that there is management

support and the system will be accepted and used. However, it is also important that the

employee base will be accepting of the change.

The essential questions that help in testing the operational feasibility of a system include the

following:

Does current mode of operation provide adequate throughput and response time?

Does current mode provide end users and managers with timely, pertinent, accurate

and useful formatted information?

Does current mode of operation provide cost-effective information services to the

business?

Could there be a reduction in cost and or an increase in benefits?

Does current mode of operation offer effective controls to protect against fraud and to

guarantee accuracy and security of data and information?

Does current mode of operation make maximum use of available resources, including

people, time, and flow of forms?

Does current mode of operation provide reliable services?

Are the services flexible and expandable?

Are the current work practices and procedures adequate to support the new system?

If the system is developed, will it be used?

Manpower problems; Labour objections; Manager resistance

Organizational conflicts and policies

Social acceptability; Government regulations

Does management support the project?

Are the users not happy with current business practices?

Will it reduce the time (operation) considerably?

Have the users been involved in the planning and development of the project?

Will the proposed system really benefit the organization?

Does the overall response increase?

Will accessibility of information be lost?

Will the system affect the customers in considerable way?

How do the end-users feel about their role in the new system?

What end-users or managers may resist or not use the system?

How will the working environment of the end-user change?

Can or will end-users and management adapt to the change?

3 INTRODUCTION

3.1 Hardware Requirements

Processor - Intel Pentium dual core

Ram - 1GB

Hard disk - 80GB

Monitor - 17’inchs

Keyboard - Logitech

Mouse - optical mouse (Logitech)

3.2 Software Requirements

Front end - Java

Back End - MS SQL SERVER

Operating System - Windows-7

Tools Used - Net Beans

4 SOFTWARE DESCRIPTION

4.1 FRONT END

4.1.1 Java introduction

Java is an object-oriented programming language developed by Sun Microsystems

and it is also a powerful internet programming language. Java is a high-level programming

language which has the following features:

1. Object oriented

2. Portable

3. Architecture-neutral

4. High-performance

5. Multithreaded

6. Robust

7. Secure

Java is an efficient application programming language. It has APIs to support the GUI

based application development. The following features of java, makes it more suitable for

implementing this project. Initially the languages were called as “OAK” but it was renamed

as “Java” in 1995. The primary motivation of this language was the need for a platform

independent language that could be used to create software to be embedded in various

consumer electronic devices. Java is programmer’s language. Java is cohesive and consistent.

Except for those constraints imposed by the internet environment, Java gives the

programmer, full control. The excitement of the Internet attracted software vendors such that

Java development tools from many vendors quickly became available. That same excitement

has provided the impetus for a multitude of software developers to discover Java and its

many wonderful features.

With most programming languages, you either compile or interpret a program so that

you can run it on your computer. The Java programming language is unusual in that a

program is both compiled and interpreted. With the compiler, first you translate a program

into an intermediate language called Java byte codes the platform-independent codes

interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java

byte code instruction on the computer. Compilation happens just once; interpretation occurs

each time the program is executed.

You can think of Java byte codes as the machine code instructions for the Java

Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web

browser that can run applets, is an implementation of the Java VM. Java byte codes help

make “write once, run anywhere” possible. You can compile your program into byte codes on

any platform that has a Java compiler. The byte codes can then be run on any implementation

of the Java VM. That means that as long as a computer has a Java VM, the same program

written in the Java programming language can run on Windows 2000, a Solaris workstation,

or on an iMac.

The Java Platform

A platform is the hardware or software environment in which a program runs. We’ve already

mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and

MacOS. Most platforms can be described as a combination of the operating system and

hardware. The Java platform differs from most other platforms in that it’s a software-only

platform that runs on top of other hardware-based platforms.

The Java platform has two components:

The Java Virtual Machine (Java VM)

The Java Application Programming Interface (Java API)

You’ve already been introduced to the Java VM. It’s the base for the Java platform and is

ported onto various hardware-based platforms. The Java API is a large collection of ready-

made software components that provide many useful capabilities, such as graphical user

interface (GUI) widgets. The Java API is grouped into libraries of related classes and

interfaces; these libraries are known as packages. The next section, What Can Java

Technology Do? Highlights what functionality some of the packages in the Java API provide.

The following figure depicts a program that’s running on the Java platform. As the figure

shows, the Java API and the virtual machine insulate the program from the hardware.

Native code is code that after you compile it, the compiled code runs on a specific hardware

platform. As a platform-independent environment, the Java platform can be a bit slower than

native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code

compilers can bring performance close to that of native code without threatening portability.

What Can Java Technology Do?

The most common types of programs written in the Java programming language are

applets and applications. If you’ve surfed the Web, you’re probably already familiar with

applets. An applet is a program that adheres to certain conventions that allow it to run within

a Java-enabled browser. However, the Java programming language is not just for writing

cute, entertaining applets for the Web. The general-purpose, high-level Java programming

language is also a powerful software platform. Using the generous API, you can write many

types of programs.

An application is a standalone program that runs directly on the Java platform. A

special kind of application known as a server serves and supports clients on a network.

Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another

specialized program is a servlet. A servlet can almost be thought of as an applet that runs on

the server side. Java Servlets are a popular choice for building interactive web applications,

replacing the use of CGI scripts.

Servlets are similar to applets in that they are runtime extensions of applications.

Instead of working in browsers, though, servlets run within Java Web servers, configuring or

tailoring the server.

How does the API support all these kinds of programs? It does so with packages of software

components that provides a wide range of functionality. Every full implementation of the

Java platform gives you the following features:

The essentials: Objects, strings, threads, numbers, input and output, data structures,

system properties, date and time, and so on.

Applets: The set of conventions used by applets.

Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram

Protocol) sockets, and IP (Internet Protocol) addresses.

Internationalization: Help for writing programs that can be localized for users

worldwide. Programs can automatically adapt to specific locales and be displayed in

the appropriate language.

Security: Both low level and high level, including electronic signatures, public and

private key management, access control, and certificates.

Software components: Known as JavaBeans TM, can plug into existing component

architectures.

Object serialization: Allows lightweight persistence and communication via Remote

Method Invocation (RMI).

Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of

relational databases.

The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration,

telephony, speech, animation, and more. The following figure depicts what is included in the

Java 2 SDK.

How Will Java Technology Change My Life?

We can’t promise you fame, fortune, or even a job if you learn the Java programming

language. Still, it is likely to make your programs better and requires less effort than other

languages. We believe that Java technology will help you do the following:

Get started quickly: Although the Java programming language is a powerful object-

oriented language, it’s easy to learn, especially for programmers already familiar with

C or C++.

Write less code: Comparisons of program metrics (class counts, method counts, and

so on) suggest that a program written in the Java programming language can be four

times smaller than the same program in C++.

Write better code: The Java programming language encourages good coding

practices, and its garbage collection helps you avoid memory leaks. Its object

orientation, its JavaBeans component architecture, and its wide-ranging, easily

extendible API let you reuse other people’s tested code and introduce fewer bugs.

Develop programs more quickly: Your development time may be as much as twice

as fast versus writing the same program in C++. Why? You write fewer lines of code

and it is a simpler programming language than C++.

Avoid platform dependencies with 100% Pure Java: You can keep your program

portable by avoiding the use of libraries written in other languages. The 100% Pure

JavaTM Product Certification Program has a repository of historical process manuals,

white papers, brochures, and similar materials online.

Write once, run anywhere: Because 100% Pure Java programs are compiled into

machine-independent byte codes, they run consistently on any Java platform.

Distribute software more easily: You can upgrade applets easily from a central

server. Applets take advantage of the feature of allowing new classes to be loaded “on

the fly,” without recompiling the entire program.

ODBC

Microsoft Open Database Connectivity (ODBC) is a standard programming interface

for application developers and database systems providers. Before ODBC became a defacto

standard for Windows programs to interface with database systems, programmers had to use

proprietary languages for each database they wanted to connect to. Now, ODBC has made the

choice of the database system almost irrelevant from a coding perspective, which is as it

should be. Application developers have much more important things to worry about than the

syntax that is needed to port their program from one database to another when business needs

suddenly change.

Through the ODBC Administrator in Control Panel, you can specify the particular

database that is associated with a data source that an ODBC application program is written

to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to

a particular database. For example, the data source named Sales Figures might be a SQL

Server database, whereas the Accounts Payable data source could refer to an Access

database. The physical database referred to by a data source can reside anywhere on the

LAN.

The ODBC system files are not installed on your system by Windows 95. Rather, they are

installed when you setup a separate database application, such as SQL Server Client or Visual

Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called

ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-

alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this

program and each maintains a separate list of ODBC data sources. From a programming

perspective, the beauty of ODBC is that the application can be written to use the same set of

function calls to interface with any data source, regardless of the database vendor. The source

code of the application doesn’t change whether it talks to Oracle or SQL Server. We only

mention these two as an example. There are ODBC drivers available for several dozen

popular database systems. Even Excel spreadsheets and plain text files can be turned into data

sources. The operating system uses the Registry information written by ODBC Administrator

to determine which low-level ODBC drivers are needed to talk to the data source (such as the

interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the

ODBC application program. In a client/server environment, the ODBC API even handles

many of the network issues for the application programmer.

The advantages of this scheme are so numerous that you are probably thinking there

must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking

directly to the native database interface. ODBC has had many detractors make the charge that

it is too slow. Microsoft has always claimed that the critical factor in performance is the

quality of the driver software that is used. In our humble opinion, this is true. The availability

of good ODBC drivers has improved a great deal recently. And anyway, the criticism about

performance is somewhat analogous to those who said that compilers would never match the

speed of pure assembly language.

Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs,

which means you finish sooner. Meanwhile, computers get faster every year.

JDBC

In an effort to set an independent database standard API for Java; Sun Microsystems

developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access

mechanism that provides a consistent interface to a variety of RDBMSs. This consistent

interface is achieved through the use of “plug-in” database connectivity modules, or drivers.

If a database vendor wishes to have JDBC support, he or she must provide the driver for each

platform that the database and Java run on.

To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As

you discovered earlier in this chapter, ODBC has widespread support on a variety of

platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much

faster than developing a completely new connectivity solution.

JDBC was announced in March of 1996. It was released for a 90 day public review

that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was

released soon after. The remainder of this section will cover enough information about JDBC

for you to know what it is about and how to use it effectively. This is by no means a complete

overview of JDBC. That would fill an entire book.

JDBC Goals

Few software packages are designed without goals in mind. JDBC is one that, because

of its many goals, drove the development of the API. These goals, in conjunction with early

reviewer feedback, have finalized the JDBC class library into a solid framework for building

database applications in Java.

The goals that were set for JDBC are important. They will give you some insight as to

why certain classes and functionalities behave the way they do. The eight design goals for

JDBC are as follows:

1. SQL Level API

The designers felt that their main goal was to define a SQL interface for Java.

Although not the lowest database interface level possible, it is at a low enough level for

higher-level tools and APIs to be created. Conversely, it is at a high enough level for

application programmers to use it confidently. Attaining this goal allows for future tool

vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end

user.

2. SQL Conformance

SQL syntax varies as you move from database vendor to database vendor. In an effort to

support a wide variety of vendors, JDBC will allow any query statement to be passed through

it to the underlying database driver. This allows the connectivity module to handle non-

standard functionality in a manner that is suitable for its users.

3. JDBC must be implemental on top of common database interfaces

The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows

JDBC to use existing ODBC level drivers by the use of a software interface. This interface

would translate JDBC calls to ODBC and vice versa.

4. Provide a Java interface that is consistent with the rest of the Java system

Because of Java’s acceptance in the user community thus far, the designers feel that they

should not stray from the current design of the core Java system.

5. Keep it simple

This goal probably appears in all software design goal listings. JDBC is no exception. Sun

felt that the design of JDBC should be very simple, allowing for only one method of

completing a task per mechanism. Allowing duplicate functionality only serves to confuse the

users of the API.

6. Use strong, static typing wherever possible

Strong typing allows for more error checking to be done at compile time; also, less

error appear at runtime.

7. Keep the common cases simple

Because more often than not, the usual SQL calls used by the programmer are simple

SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to

perform with JDBC. However, more complex SQL statements should also be possible.

Finally we decided to proceed the implementation using Java Networking. And for

dynamically updating the cache table we go for MS Access database. Java has two things: a

programming language and a platform.

You can think of Java byte codes as the machine code instructions for the, Java

Virtual Machine (Java VM). Every Java interpreter, whether it’s a Java development tool or a

Web browser that can run Java applets, is an implementation of the Java VM. The Java VM

can also be implemented in hardware.

Java byte codes help make “write once, run anywhere” possible. You can compile your Java

program into byte codes on my platform that has a Java compiler. The byte codes can then be

run any implementation of the Java VM. For example, the same Java program can run

Windows NT, Solaris, and Macintosh.

Networking

TCP/IP stack

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless

protocol.

IP datagram’s

The IP layer provides a connectionless and unreliable delivery system. It considers each

datagram independently of the others. Any association between datagram must be supplied

by the higher layers. The IP layer supplies a checksum that includes its own header. The

header includes the source and destination addresses. The IP layer handles routing through an

Internet. It is also responsible for breaking up large datagram into smaller ones for

transmission and reassembling them at the other end.

UDP

UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents

of the datagram and port numbers. These are used to give a client/server model - see later.

TCP

TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a

virtual circuit that two processes can use to communicate.

Internet addresses

In order to use a service, you must be able to find it. The Internet uses an address scheme for

machines so that they can be located. The address is a 32 bit integer which gives the IP

address. This encodes a network ID and more addressing. The network ID falls into various

classes according to the size of the network address.

Network address

Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class

B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all

32.

Subnet address

Internally, the UNIX network is divided into sub networks. Building 11 is currently on one

sub network and uses 10-bit addressing, allowing 1024 different hosts.

Host address

8 bits are finally used for host addresses within our subnet. This places a limit of 256

machines that can be on the subnet.

Total address

The 32 bit address is usually written as 4 integers separated by dots.

Port addresses

A service exists on a host, and is identified by its port. This is a 16 bit number. To send a

message to a server, you send it to the port for that service of the host that it is running on.

This is not location transparency! Certain of these ports are "well known".

Sockets

A socket is a data structure maintained by the system to handle network connections. A

socket is created using the call socket. It returns an integer that is like a file descriptor. In

fact, under Windows, this handle can be used with Read File and Write File functions.

#include <sys/types.h>

#include <sys/socket.h>

int socket(int family, int type, int protocol);

Here "family" will be AF_INET for IP communications, protocol will be zero, and type will

depend on whether TCP or UDP is used. Two processes wishing to communicate over a

network create a socket each. These are similar to two ends of a pipe - but the actual pipe

does not yet exist.

JFree Chart

JFreeChart is a free 100% Java chart library that makes it easy for developers to display

professional quality charts in their applications. JFreeChart's extensive feature set includes: A

consistent and well-documented API, supporting a wide range of chart types.

A flexible design that is easy to extend, and targets both server-side and client-side

applications; Support for many output types, including Swing components, image files

(including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);

JFreeChart is "open source" or, more specifically, free software. It is distributed under the

terms of the GNU Lesser General Public License (LGPL), which permits use in proprietary

applications.

1. Map Visualizations

Charts showing values that relate to geographical areas. Some examples include: (a)

population density in each state of the United States, (b) income per capita for each

country in Europe, (c) life expectancy in each country of the world.

The tasks in this project include:

Sourcing freely redistributable vector outlines for the countries of the world, states/provinces

in particular countries (USA in particular, but also other areas); Creating an appropriate

dataset interface (plus default implementation), a rendered, and integrating this with the

existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting

some more.

2. Time Series Chart Interactivity

Implement a new (to JFreeChart) feature for interactive time series charts --- to display a

separate control that shows a small version of ALL the time series data, with a sliding "view"

rectangle that allows you to select the subset of the time series data to display in the main

chart.

3. Dashboards

There is currently a lot of interest in dashboard displays. Create a flexible dashboard

mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars,

and lines/time series) that can be delivered easily via both Java Web Start and an applet.

4. Property Editors

The property editor mechanism in JFreeChart only handles a small subset of the properties

that can be set for charts. Extend (or re-implement) this mechanism to provide greater end-

user control over the appearance of the charts.

J2ME (Java 2 Micro edition):-

Sun Microsystems defines J2ME as "a highly optimized Java run-time environment targeting

a wide range of consumer products, including pagers, cellular phones, screen-phones, digital

set-top boxes and car navigation systems." Announced in June 1999 at the Java One

Developer Conference, J2ME brings the cross-platform functionality of the Java language to

smaller devices, allowing mobile wireless devices to share applications. With J2ME, Sun has

adapted the Java platform for consumer products that incorporate or are based on small

computing devices.

1. General J2ME architecture

J2ME uses configurations and profiles to customize the Java Runtime Environment (JRE). As

a complete JRE, J2ME is comprised of a configuration, which determines the JVM used, and

a profile, which defines the application by adding domain-specific classes. The configuration

defines the basic run-time environment as a set of core classes and a specific JVM that run on

specific types of devices. We'll discuss configurations in detail in the The profile defines the

application; specifically, it adds domain-specific classes to the J2ME configuration to define

certain uses for devices. The following graphic depicts the relationship between the different

virtual machines, configurations, and profiles

It also draws a parallel with the J2SE API and its Java virtual machine. While the J2SE

virtual machine is generally referred to as a JVM, the J2ME virtual machines, KVM and

CVM, are subsets of JVM. Both KVM and CVM can be thought of as a kind of Java virtual

machine -- it's just that they are shrunken versions of the J2SE JVM and are specific to

J2ME.

2. Developing J2ME applications

Introduction In this section, we will go over some considerations you need to keep in mind

when developing applications for smaller devices. We'll take a look at the way the compiler

is invoked when using J2SE to compile J2ME applications. Finally, we'll explore packaging

and deployment and the role pre-verification plays in this process.

3. Design considerations for small devices

Developing applications for small devices requires you to keep certain strategies in mind

during the design phase. It is best to strategically design an application for a small device

before you begin coding. Correcting the code because you failed to consider all of the

"gotchas" before developing the application can be a painful process. Here are some design

strategies to consider:

* Keep it simple. Remove unnecessary features, possibly making those features a separate,

secondary application.

* Smaller is better. This consideration should be a "no brainer" for all developers. Smaller

applications use less memory on the device and require shorter installation times. Consider

packaging your Java applications as compressed Java Archive (jar) files.

* Minimize run-time memory use. To minimize the amount of memory used at run time, use

scalar types in place of object types. Also, do not depend on the garbage collector. You

should manage the memory efficiently yourself by setting object references to null when you

are finished with them. Another way to reduce run-time memory is to use lazy instantiation,

only allocating objects on an as-needed basis.

Other ways of reducing overall and peak memory use on small devices are to release

resources quickly, reuse objects, and avoid exceptions.

4. Configurations overview

The configuration defines the basic run-time environment as a set of core classes and a

specific JVM that run on specific types of devices. Currently, two configurations exist for

J2ME, though others may be defined in the future:

* Connected Limited Device Configuration (CLDC) is used specifically with the KVM for

16-bit or 32-bit devices with limited amounts of memory. This is the configuration (and the

virtual machine) used for developing small J2ME applications. Its size limitations make

CLDC more interesting and challenging (from a development point of view) than CDC.

CLDC is also the configuration that we will use for developing our drawing tool application.

An example of a small wireless device running small applications is a Palm hand-held

computer.

* Connected Device Configuration (CDC) is used with the C virtual machine (CVM) and is

used for 32-bit architectures requiring more than 2 MB of memory. An example of such a

device is a Net TV box.

J2ME profiles

What is a J2ME profile?

As we mentioned earlier in this tutorial, a profile defines the type of device supported. The

Mobile Information Device Profile (MIDP), for example, defines classes for cellular phones.

It adds domain-specific classes to the J2ME configuration to define uses for similar devices.

Two profiles have been defined for J2ME and are built upon CLDC: KJava and MIDP. Both

KJava and MIDP are associated with CLDC and smaller devices. Profiles are built on top of

configurations. Because profiles are specific to the size of the device (amount of memory) on

which an application runs, certain profiles are associated with certain configurations.

A skeleton profile upon which you can create your own profile, the Foundation Profile, is

available for CDC.

Profile 1: KJava

KJava is Sun's proprietary profile and contains the KJava API. The KJava profile is built on

top of the CLDC configuration. The KJava virtual machine, KVM, accepts the same byte

codes and class file format as the classic J2SE virtual machine. KJava contains a Sun-specific

API that runs on the Palm OS. The KJava API has a great deal in common with the J2SE

Abstract Windowing Toolkit (AWT). However, because it is not a standard J2ME package,

its main package is com.sun.kjava. We'll learn more about the KJava API later in this tutorial

when we develop some sample applications.

Profile 2: MIDP

MIDP is geared toward mobile devices such as cellular phones and pagers. The MIDP, like

KJava, is built upon CLDC and provides a standard run-time environment that allows new

applications and services to be deployed dynamically on end user devices. MIDP is a

common, industry-standard profile for mobile devices that is not dependent on a specific

vendor. It is a complete and supported foundation for mobile application development. MIDP

contains the following packages, the first three of which are core CLDC packages, plus three

MIDP-specific packages.

* java.lang

* java.io

* java.util

* javax.microedition.io

* javax.microedition.lcdui

* javax.microedition.midlet

* javax.microedition.rms

5 PROJECT DESCRIPTION

5.1 System Architecture

Information retrieval has been a major area of application of NLP, and consequently a

number of research projects, dealing with the various applications on NLP in IR, have taken

place throughout the world resulting in a large volume of publications. Lewis and Sparck

Jones (2013) comment that the generic challenge for NLP in the field of IR is whether the

necessary NLP of texts and queries is doable, and the specific challenges are whether non-

statistical and statistical data can be combined and whether data about individual documents

and whole files can be combined. They further comment that there are major challenges in

making the NLP technology operate effectively and efficiently and also in conducting

appropriate evaluation tests to assess whether and how far the approach works in an

environment of interactive searching of large text files. Feldman (2013) suggests that in order

to achieve success in IR, NLP techniques should be applied in conjunction with other

technologies, such as visualization, intelligent agents and speech recognition.

Arguing that syntactic phrases are more meaningful than statistically obtained word pairs,

and thus are more powerful for discriminating among documents, Narita and Ogawa (2012)

use a shallow syntactic processing instead of statistical processing to automatically identify

candidate phrasal terms from query texts. Comparing the performance of Boolean and natural

language searches, Paris and Tibbo (2013) found that in their experiment, Boolean searches

had better results than freestyle (natural language) searches. However, they concluded that

neither could be considered as the best for every query. In other words, their conclusion was

that different queries demand different techniques.

Variations in presenting subject matter greatly affect IR and hence linguistic variation of

document texts is one of the greatest challenges to IR. In order to investigate how

consistently newspapers choose words and concepts to describe an event, Lehtokangas &

Jarvelin (2011) chose articles on the same news from three Finnish newspapers. Their

experiment revealed that for short newswire the consistency was 83% and for long articles

47%. It was also revealed that the newspapers were very consistent in using concepts to

represent events, with a level of consistency varying between 92-97%.

Natural Language Interfaces

A natural language interface is one that accepts query statements or commands in

natural language and sends data to some system, typically a retrieval system, which then

results in appropriate responses to the commands or query statements. A natural language

interface should be able to translate the natural language statements into appropriate actions

for the system. A large number of natural language interfaces that work reasonably well in

narrow domains have been reported in the literature. Much of the efforts in natural language

interface design to date have focused on handling rather simple natural language queries. A

number of question answering systems are now being developed that aim to provide answers

to natural language questions, as opposed to documents containing information related to the

question. Such systems often use a variety of IE and IR operations using NLP tools and

techniques to get the correct answer from the source texts. Breck et al. (2012) report a

question answering system that uses techniques from knowledge representation, information

retrieval, and NLP. The authors claim that this combination enables domain independence

and robustness in the face of text variability, both in the question and in the raw text

documents used as knowledge sources. Research reported in the Question Answering (QA)

track of TREC (Text Retrieval Conferences) show some interesting results. The basic

technology used by the participants in the QA track included several steps. First, cue

words/phrase like ‘who’ (as in ‘who is the prime minister of Japan’), ‘when’ (as in ‘When did

the Jurassic period end’) were identified to guess what was needed; and then a small portion

of the document collection was retrieved using standard text retrieval technology.

This was followed by a shallow parsing of the returned documents for identifying the

entities required for an answer. If no appropriate answer type was found then best matching

passage was retrieved. This approach works well as long as the query types recognized by the

system have broad coverage, and the system can classify questions reasonably accurately. In

TREC-8, the first QA track of TREC, the most accurate QA systems could answer more than

2/3 of the questions correctly. In the second QA track (TREC-9), the best performing QA

system, the Falcon system from Southern Methodist University, was able to answer 65% of

the questions (Voorhees, 2000). These results are quite impressive in a domain-independent

question answering environment. However, the questions were still simple in the first two

QA tracks. In the future more complex questions requiring answers to be obtained from more

than one documents will be handled by QA track researchers.

The Natural Language Processing Laboratory, Centre for Intelligent Information

Retrieval at the University of Massachusetts, distributes source codes and executables to

support IE system development efforts at other sites. Each module is designed to be used in a

domain-specific and task-specific customizable IE system. Available software includes

(Natural Language …, n.d.)

MARMOT Text Bracketing Module, a text file translator which segments arbitrary

text blocks into sentences, applies low-level specialists such as date recognizers,

associates words with part-of-speech tags, and brackets the text into annotated noun

phrases, prepositional phrases, and verb phrases.

BADGER Extraction Module that analyses bracketed text and produces case frame

instantiations according to application-specific domain guidelines.

CRYSTAL Dictionary Induction Module, which learns text extraction rules, suitable

for use by BADGER, from annotated training texts.

ID3-S Inductive Learning Module, a variant on ID3 which induces decision trees on

the basis of training examples.

5.2 MODULE DESCRIPTION

Section Splitter

Section Filter

Text Tokenizer

Part-of-Speech (POS) Tagger

Noun Phrase Finder

UMLS Concept Finder

Negation Finder

Regular Expression-based Concept Finder

Sentence Splitter

N-Gram Tool

Classifier (e.g. Smoking Status Classifier)

5.2.1 Section Splitter

For this project, you will write the lexical analysis phase (i.e., the "scanner") of a simple

compiler for a subset of the language "Tubular". We will start with only one type of variable

("int"), basic math, and the print command to output results; basically it will be little more

than a standard calculator. Over the next two projects we will turn this into a working

compiler, and in the four projects following that, we will expand the functionality and

efficiency of the language. The program you turn in must load in the source file (as a

command-line argument) and process it line-by-line, removing whitespace and comments and

categorizing each word or symbol as a token. A token is a type of unit that appears in a

source file. You will then output (to standard out) a tokenized version of the file, as described

in more detail below. A pattern is a rule by which a token is identified. It will be up to you

as part of this project to identify the patterns associated with each token. A lexeme is an

instance of a pattern from a source file. For example, "42" is a lexeme that would be

categorized as the token STATIC_INT. On the other hand "my_var" is a lexeme that would

get identified as a token of the type ID. When multiple patterns match the text being

processed, choose the one that produces the longest lexeme that starts at the current position.

If two different patterns produce lexemes of the same length, choose the one that

comes first in the list above. For example, the string "print" might be incorrectly read as the

ID "pr" followed by the TYPE "int", but "print" is longer than "pr", so it should be chosen.

Likewise, the lexeme "print" could match either the pattern or COMMAND or the pattern for

ID, but COMMAND should be chosen since it comes first in the list in the table.

Each student must write this project on their own, with no help from other students or

any other individuals, though you may use whatever pre-existing web resources you like. I

will use your score on this project and your programing style as factors in assembling groups

for future projects. As such, it's well worth putting extra effort into this project. Your lexer

should be able to identify each of the following tokens that it encounters:

Token DescriptionTYPE Data types: currently just "int", but more types will be introduced in

future projects.COMMAND Any built-in commands: currently just "print".ID A sequence beginning with a letter or underscore ('_'), followed by a

sequence of zero or more characters that may contain letters, numbers and underscores. Currently just variable names.

STATIC_INT Any static interger. We will implement static floating point numbers in a future project, as well as other static types.

OPERATOR Math operators: + - * / % ( ) = += -= *= /= %=SEPARATOR List separation: ,ENDLINE Signifies the end of a statement -- semicolon: ;WHITESPACE

Any number of consecutive spaces, tabs, or newlines.

COMMENT Everything on a line following a pound-sign, '#'.UNKNOWN An unknown character or a sequence that does not match any of the

tokens above.

5.2.2 Section Filter

The fundamental component of a tagger is a POS (part-of-speech) tagset, or list of all

the word categories that will be used in the tagging process. The tagsets used to annotate

large corpora in the past have usually been fairly extensive. The pioneering Brown Corpus

distinguishes 87 simple tags. Subsequent projects have tended to elaborate the Brown Corpus

tagset. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about 135 tags, the

Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English

197 tags. The rationale behind developing such large, richly articulated tagsets is to approach

"the ideal of providing distinct codings for all classes of words having distinct grammatical

behaviour". Choosing an appropriate set of tags for an annotation system has direct influence

on the accuracy and the usefulness of the tagging system. The larger the tag set, the lower the

accuracy of the tagging system since the system has to be capable of making finer

distinctions. Conversely, the smaller the tag set, the higher the accuracy. However, a very

small tag set tends to make the tagging system less useful since it provides less information.

So, there is a trade-off here. Another issue in tag-set design is the consistency of the tagging

system. Words of the same meaning and same functions should be tagged with the same tags.

5.2.3 Text Tokenizer

Machine Learning methods usually require supervised data to learn a concept.

Labelling data is time consuming, tedious, error prone and expensive. The research

community has looked at semi-supervised and unsupervised learning techniques in order to

obviate the need of labelled data to a certain extent. In addition to the above mentioned

problems with labelled data, all examples are not equally informative or equally easy to label.

For instance, the examples similar to what the learner has already seen are not as useful as

new examples. Moreover, deferent examples may require deferent amount of user's labelling

report, for instance, a longer sentence is likely to have more ambiguities and hence would be

harder to parse manually. Active learning is the task of reducing the amount of labelled data

required to learn the target concept by querying the user for labels for the most informative

examples so that the concept is learnt with fewer examples. An active learning problem

setting typically consists of a small set of labelled examples and a large set of unlabelled

examples. An initial classier is trained on the labelled examples and/or the unlabelled

examples. From the pool of unlabelled examples, selective sampling is used to create a small

subset of examples for the user to label. This iterative process of training, selective sampling

and annotation is repeated until convergence.

Cross-cultural communication, as in many scholarly fields, is a combination of many other fields. These fields include anthropology, cultural studies, psychology and communication. The field has also moved both toward the treatment of interethnic relations, and toward the study of communication strategies used by co-cultural populations, i.e., communication strategies used to deal with majority or mainstream populations.

Software-oriented mechanisms are less expensive and more flexible in filter lookups

when compared with their hardware-centric counterparts. Such mechanisms are

abundant, commonly involving efficient algorithms for quick packet classification with

an aid of caching or hashing. Their Classification speeds rely on efficiency in search over

the rule set using the keys constituted by corresponding header fields. Several

representative software classification techniques are reviewed in sequence.

5.2.4 Part-of-Speech (POS) Tagger

An active learning experiment is usually described by ¯ve properties: number of

bootstrap examples, batch size, supervised learner, data set and a stopping criterion. The

supervised learner is trained on the bootstrap examples which are labelled by the user

initially. Batch size is the number of examples that are selectively sampled from the

unlabelled pool and added to the training pool in each iteration. The stopping criterion can be

either a desired performance level or the number of iterations. Performance is evaluated on

the test set in each iteration.

Active learners are usually evaluated by plotting a learning curve of performance vs.

number of labelled examples as shown in ¯gore 1. Success of an active learner is demon-

started by showing that it achieves better performance than a traditional learner given the

same number of labelled examples; i.e., for achieving the desired performance, the active

learner needs fewer examples than the traditional learner.

5.2.5 Noun Phrase Finder

Active learning aims at reducing the number of examples required to achieve the desired

accuracy by selectively sampling the examples for user to label and train the classier with.

Several deferent strategies for selective sampling have been explored in the literature. In this

review, we present some of the selective sampling techniques used for active learning in

NLP. Uncertainty-based sampling selects examples that the model is least certain about and

presents them to the user for correction/verification. A lot of work on active learning has used

uncertainty-based sampling. In this section, we describe some of this work.

5.2.6 UMLS Concept Finder

Parsers that parameterize over wider scopes are generally more accurate than edge-

factored models. For graph-based non-projective parsers, wider factorizations have so far

implied large increases in the computational complexity of the parsing problem. This paper

introduces a “crossing-sensitive” generalization of a third-order factorization that trades off

complexity in the model structure (i.e., scoring with features over multiple edges) with

complexity in the output structure (i.e., producing crossing edges). Under this model, the

optimal 1-Endpoint-Crossing tree can be found in O(n^4) time, matching the asymptotic run-

time of both the third-order projective parser and the edge-factored 1-Endpoint-Crossing

parser. The crossing-sensitive third-order parser is significantly more accurate than the third-

order projective parser under many experimental settings and significantly less accurate on

none.

5.2.7 Negation Finder

Although many NLP systems are moving toward entity-based processing, most still identify

important phrases using classical keyword-based approaches. To bridge this gap, we

introduce the task of entity salience: assigning a relevance score to each entity in a document.

We demonstrate how a labeled corpus for the task can be automatically generated from a

corpus of documents and accompanying abstracts. We then show how a classifier with

features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally,

we outline initial experiments on further improving accuracy by leveraging background

knowledge about the relationships between entities.

5.2.8 Regular Expression-based Concept Finder

Frame semantics is a linguistic theory that has been instantiated for English in the FrameNet

lexicon. We solve the problem of frame-semantic parsing using a two-stage statistical model

that takes lexical targets (i.e., content words and phrases) in their sentential contexts and

predicts frame-semantic structures. Given a target in context, the first stage disambiguates it

to a semantic frame. This model employs latent variables and semi-supervised learning to

improve frame disambiguation for targets unseen at training time. The second stage finds the

target's locally expressed semantic arguments. At inference time, a fast exact dual

decomposition algorithm collectively predicts all the arguments of a frame at once in order to

respect declaratively stated linguistic constraints, resulting in qualitatively better structures

than naïve local predictors. Both components are feature-based and discriminatively trained

on a small set of annotated frame-semantic parses. On the SemEval 2007 benchmark dataset,

the approach, along with a heuristic identifier of frame-evoking targets, outperforms the prior

state of the art by significant margins. Additionally, we present experiments on the much

larger Frame Net 1.5 dataset. We have released our frame-semantic parser as open-source

software.

5.2.9 Sentence Splitter

Mobile is poised to become the predominant platform over which people are

accessing the World Wide Web. Recent developments in speech recognition and

understanding, backed by high bandwidth coverage and high quality speech signal acquisition

on smartphones and tablets are presenting the users with the choice of speaking their web

search queries instead of typing them. A critical component of a speech recognition system

targeting web search is the language model. The chapter presents an empirical exploration of

the google.com query stream with the end goal of high quality statistical language modeling

for mobile voice search. Our experiments show that after text normalization the query stream

is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1%

using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high

orders such as n=5/4, respectively. A more careful analysis shows that a significantly larger

vocabulary (approx. 10 million words) may be required to guarantee at most 1% out-of-

vocabulary rate for a large percentage (95%) of users. Using large scale, distributed language

models can improve performance significantly---up to 10% relative reductions in word-error-

rate over conventional models used in speech recognition. We also find that the query stream

is non-stationary, which means that adding more past training data beyond a certain point

provides diminishing returns, and may even degrade performance slightly. Perhaps less

surprisingly, we have shown that locale matters significantly for English query data across

USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs,

we successfully build large-scale discriminative N-gram language models and derive small but

significant gains in recognition performance.

5.2.10 N-Gram Tool

Empty categories (EC) are artificial elements in Penn Treebanks motivated by the

government-binding (GB) theory to explain certain language phenomena such as pro-drop.

ECs are ubiquitous in languages like Chinese, but they are tacitly ignored in most machine

translation (MT) work because of their elusive nature. In this paper we present a

comprehensive treatment of ECs by first recovering them with a structured MaxEnt model

with a rich set of syntactic and lexical features, and then incorporating the predicted ECs into

a Chinese-to-English machine translation task through multiple approaches, including the

extraction of EC-specific sparse features. We show that the recovered empty categories not

only improve the word alignment quality, but also lead to significant improvements in a

large-scale state-of-the-art syntactic MT system.

5.2.11 Classifier (e.g. Smoking Status Classifier)

Many highly engineered NLP systems address the benchmark tasks using linear statistical

models applied to task-specific features. In other words, the researchers themselves discover

intermediate representations by engineering ad-hoc features. These features are often derived

from the output of pre-existing systems, leading to complex runtime dependencies. This

approach is effective because researchers leverage a large body of linguistic knowledge. On

the other hand, there is a great temptation to over-engineer the system to optimize its

performance on a particular benchmark at the expense of the broader NLP goals.

In this contribution, we describe a unified NLP system that achieves excellent performance

on multiple benchmark tasks by discovering its own internal representations. We have

avoided engineering features as much as possible and we have therefore ignored a large body

of linguistic knowledge. Instead we reach state-of-the-art performance levels by transferring

intermediate representations discovered on massive unlabelled datasets. We call this

approach “almost from scratch” to emphasize this reduced (but still important) reliance on a

priori NLP knowledge.

5 INTRODUCTION

Software testing is a critical element of software quality assurance and represents the

ultimate review of specification, design and coding. In fact, testing is the one step in the software

engineering process that could be viewed as destructive rather than constructive.

A strategy for software testing integrates software test case design methods into a

well-planned series of steps that result in the successful construction of software. Testing is

the set of activities that can be planned in advance and conducted systematically. The

underlying motivation of program testing is to affirm software quality with methods that can

economically and effectively apply to both strategic to both large and small-scale systems.

8.2. STRATEGIC APPROACH TO SOFTWARE TESTING

The software engineering process can be viewed as a spiral. Initially system

engineering defines the role of software and leads to software requirement analysis where the

information domain, functions, behavior, performance, constraints and validation criteria for

software are established. Moving inward along the spiral, we come to design and finally to

coding. To develop computer software we spiral in along streamlines that decrease the level

of abstraction on each turn.

A strategy for software testing may also be viewed in the context of the spiral. Unit

testing begins at the vertex of the spiral and concentrates on each unit of the software as

implemented in source code. Testing will progress by moving outward along the spiral to

integration testing, where the focus is on the design and the construction of the software

architecture. Talking another turn on outward on the spiral we encounter validation testing

where requirements established as part of software requirements analysis are validated

against the software that has been constructed. Finally we arrive at system testing, where the

software and other system elements are tested as a whole.

MODULE TESTING

UNIT TESTING

MODULE TESTING

8.3. UNIT TESTING

Unit testing focuses verification effort on the smallest unit of software design, the module. The unit

testing we have is white box oriented and some modules the steps are conducted in parallel.

1. WHITE BOX TESTING

This type of testing ensures that

All independent paths have been exercised at least once

All logical decisions have been exercised on their true and false sides

All loops are executed at their boundaries and within their operational bounds

All internal data structures have been exercised to assure their validity.

To follow the concept of white box testing we have tested each form .we have created

independently to verify that Data flow is correct, All conditions are exercised to check their

validity, All loops are executed on their boundaries.

2. BASIC PATH TESTING

Established technique of flow graph with Cyclomatic complexity was used to derive test cases for all the functions. The main steps in deriving test cases were:

Use the design of the code and draw correspondent flow graph.

Determine the Cyclomatic complexity of resultant flow graph, using formula:

V(G)=E-N+2 or

V(G)=P+1 or

V(G)=Number Of Regions

Where V(G) is Cyclomatic complexity,

E is the number of edges,

N is the number of flow graph nodes,

P is the number of predicate nodes.

Determine the basis of set of linearly independent paths.

3. CONDITIONAL TESTING

In this part of the testing each of the conditions were tested to both true and false aspects. And all the resulting paths were tested. So that each path that may be generate on particular condition is traced to uncover any possible errors.

4. DATA FLOW TESTING

This type of testing selects the path of the program according to the location of definition and

use of variables. This kind of testing was used only when some local variable were declared.

The definition-use chain method was used in this type of testing. These were particularly

useful in nested statements.

5. LOOP TESTING

In this type of testing all the loops are tested to all the limits possible. The following exercise was adopted for all loops:

All the loops were tested at their limits, just above them and just below them.

All the loops were skipped at least once.

For nested loops test the inner most loop first and then work outwards.

For concatenated loops the values of dependent loops were set with the help of connected loop.

Unstructured loops were resolved into nested loops or concatenated loops and tested as above.

Each unit has been separately tested by the development team itself and all the input have been validated.

System Security

The protection of computer based resources that includes hardware, software, data, procedures and people against unauthorized use or natural

Disaster is known as System Security.

System Security can be divided into four related issues:

Security Integrity Privacy Confidentiality

SYSTEM SECURITY refers to the technical innovations and procedures applied to the hardware and operation systems to protect against deliberate or accidental damage from a defined threat.

DATA SECURITY is the protection of data from loss, disclosure, modification and destruction.

SYSTEM INTEGRITY refers to the power functioning of hardware and programs, appropriate physical security and safety against external threats such as eavesdropping and wiretapping.

PRIVACY defines the rights of the user or organizations to determine what information they are willing to share with or accept from others and how the organization can be protected against unwelcome, unfair or excessive dissemination of information about it.

CONFIDENTIALITY is a special status given to sensitive information in a database to minimize the possible invasion of privacy. It is an attribute of information that characterizes its need for protection.

9.3 SECURITY SOFTWARE

System security refers to various validations on data in form of checks and controls to avoid the system from failing. It is always important to ensure that only valid data is entered and only valid operations are performed on the system. The system employees two types of checks and controls:

CLIENT SIDE VALIDATION

Various client side validations are used to ensure on the client side that only valid data is entered. Client side validation saves server time and load to handle invalid data. Some checks imposed are:

VBScript in used to ensure those required fields are filled with suitable data only. Maximum lengths of the fields of the forms are appropriately defined.

Forms cannot be submitted without filling up the mandatory data so that manual mistakes of submitting empty fields that are mandatory can be sorted out at the client side to save the server time and load.

Tab-indexes are set according to the need and taking into account the ease of user while working with the system.

SERVER SIDE VALIDATION

Some checks cannot be applied at client side. Server side checks are necessary to save the system from failing and intimating the user that some invalid operation has been performed or the performed operation is restricted. Some of the server side checks imposed is:

Server side constraint has been imposed to check for the validity of primary key and foreign key. A primary key value cannot be duplicated. Any attempt to duplicate the primary value results into a message intimating the user about those values through the forms using foreign key can be updated only of the existing foreign key values.

User is intimating through appropriate messages about the successful operations or exceptions occurring at server side.

Various Access Control Mechanisms have been built so that one user may not agitate upon another. Access permissions to various types of users are controlled according to the organizational structure. Only permitted users can log on to the system and can have access according to their category. User- name, passwords and permissions are controlled o the server side.

Using server side validation, constraints on several restricted operations are imposed.

We use two orthogonal methods to utilize automatically detected human

attributes to significantly improve content-based face image retrieval. Attribute-enhanced

sparse coding exploits the global structure and uses several human attributes to construct

semantic-aware code words in the offline stage. Attribute-embedded inverted indexing

further considers the local attribute signature of the query image and still ensures efficient

retrieval in the online stage. The experimental results show that using the code words

generated by the proposed coding scheme, we can reduce the quantization error and

achieve salient gains in face retrieval on two public datasets; the proposed indexing scheme

can be easily integrated into inverted index, thus maintaining a scalable framework. During

the experiments, we also discover certain informative attributes for face retrieval across

different datasets and these attributes are also promising for other applications. Current

methods treat all attributes as equal. We will investigate methods to dynamically decide the

importance of the attributes and further exploit the contextual relationships between them.

Documents

Speech Convcerter