Upload
ysf1991
View
7
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Speech Converter Project Report
Citation preview
1. INTRODUCTION
Machine Translation is a great example of how cutting edge research and world class
infrastructure come together at Google. We focus our research efforts towards developing
statistical translation techniques that improve with more data and generalize well to new
languages. Our large scale computing infrastructure allows us to rapidly experiment with new
models trained on web-scale data to significantly improve translation quality. This cutting-
edge research backs the translations served at translate.google.com, allowing our users to
translate text, web pages and even speech. Deployed within a wide range of Google services
like GMail, Books, Android and web search, Google Translate is a high impact, research
driven product that bridges the language barrier and makes it possible to explore the
multilingual web in 63 languages. Exciting research challenges abound as we pursue human
quality translation and develop machine translation systems for new languages. However,
adopting an “always-on” strategy has several disadvantages.
In this work we present two extensions to the well-known dynamic programming beam
search in phrase-based statistical machine translation (SMT), aiming at increased effi- ciency
of decoding by minimizing the number of language model computations and hypothesis
expansions. Our results show that language model based pre-sorting yields a small
improvement in translation quality and a speedup by a factor of 2. Two look-ahead methods
are shown to further increase translation speed by a factor of 2 without changing the search
space and a factor of 4 with the side-effect of some additional search errors. We compare our
approach with Moses and observe the same performance, but a substantially better trade-off
between translation quality and speed. At a speed of roughly 70 words per second, Moses
reaches 17.2% BLEU, whereas our approach yields 20.0% with identical models.
When trained on very large parallel corpora, the phrase table component of a machine
translation system grows to consume vast computational resources. In this paper, we
introduce a novel pruning criterion that places phrase table pruning on a sound theoretical
foundation. Systematic experiments on four language pairs under various data conditions
show that our principled approach is superior to existing ad hoc pruning methods. We
propose an unsupervised method for clustering the translations of a word, such that the
translations in each cluster share a common semantic sense. Words are assigned to clusters
based on their usage distribution in large monolingual and parallel corpora using the soft K-
Means algorithm. In addition to describing our approach, we formalize the task of translation
sense clustering and describe a procedure that leverages WordNet for evaluation. By
comparing our induced clusters to reference clusters generated from WordNet, we
demonstrate that our method effectively identifies sense-based translation clusters and
benefits from both monolingual and parallel corpora. Finally, we describe a method for
annotating clusters with usage examples.
Our Contributions
We present a simple and effective infrastructure for domain adaptation for statistical
machine translation (MT). To build MT systems for different domains, it trains, tunes and
deploys a single translation system that is capable of producing adapted domain translations
and preserving the original generic accuracy at the same time. The approach unites automatic
domain detection and domain model parameterization into one system. Experiment results on
20 language pairs demonstrate its viability. Translating compounds is an important problem
in machine translation. Since many compounds have not been observed during training, they
pose a challenge for translation systems. Previous decompounding methods have often been
restricted to a small set of languages as they cannot deal with more complex compound
forming processes. We present a novel and unsupervised method to learn the compound parts
and morphological operations needed to split compounds into their compound parts. The
method uses a bilingual corpus to learn the morphological operations required to split a
compound into its parts. Furthermore, monolingual corpora are used to learn and filter the set
of compound part candidates. We evaluate our method within a machine translation task and
show significant improvements for various languages to show the versatility of the approach.
1.1 Overview
Machine Translation is an important technology for localization, and is particularly
relevant in a linguistically diverse country like India. In this document, we provide a brief
survey of Machine Translation in India. Human translation in India is a rich and ancient
tradition. Works of philosophy, arts, mythology, religion, science and folklore have been
translated among the ancient and modern Indian languages. Numerous classic works of art,
ancient, medieval and modern, have also been translated between European and Indian
languages since the 18th century.
Problem Statement
In the current era, human translation finds application mainly in the administration,
media and education, and to a lesser extent, in business, arts and science and technology.
India has a linguistically rich area—it has 18 constitutional languages, which are written in
10 different scripts. Hindi is the official language of the Union. English is very widely used in
the media, commerce, science and technology and education. Many of the states have their
own regional language, which is either Hindi or one of the other constitutional languages.
Only about 5% of the population speaks English. In such a situation, there is a big market for
translation between English and the various Indian languages. Currently, this translation is
essentially manual. Use of automation is largely restricted to word processing. Two specific
examples of high volume manual translation are—translation of news from English into local
languages, translation of annual reports of government departments and public sector units
among, English, Hindi and the local language.
As is clear from above, the market is largest for translation from English into Indian
languages, primarily Hindi. Hence, it is no surprise that a majority of the Indian Machine
Translation (MT) systems are for English-Hindi translation. As is well known, natural
language processing presents many challenges, of which the biggest is the inherent ambiguity
of natural language. MT systems have to deal with ambiguity, and various other NL
phenomena. In addition, the linguistic diversity between the source and target language
makes MT a bigger challenge. This is particularly true of widely divergent languages such as
English and Indian languages.
The major structural difference between English and Indian languages can be
summarized as follows. English is a highly positional language with rudimentary
morphology, and default sentence structure as SVO. Indian languages are highly inflectional,
with a rich morphology, relatively free word order, and default sentence structure as SOV. In
addition, there are many stylistic differences. For example, it is common to see very long
sentences in English, using abstract concepts as the subjects of sentences, and stringing
several clauses together (as in this sentence!). Such constructions are not natural in Indian
languages, and present major difficulties in producing good translations. As is recognized the
world over, with the current state of art in MT, it is not possible to have Fully Automatic,
High Quality, and General-Purpose Machine Translation. Practical systems need to handle
ambiguity and the other complexities of natural language processing, by relaxing one or more
of the above dimensions.
1.2 LITERATURE SURVEY
Natural Language Processing (NLP) is an area of research and application that explores how
computers can be used to understand and manipulate natural language text or speech to do
useful things. NLP researchers aim to gather knowledge on how human beings understand
and use language so that appropriate tools and techniques can be developed to make
computer systems understand and manipulate natural languages to perform the desired tasks.
The foundations of NLP lie in a number of disciplines, viz. computer and information
sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence
and robotics, psychology, etc. Applications of NLP include a number of fields of studies,
such as machine translation, natural language text processing and summarization, user
interfaces, multilingual and cross language information retrieval (CLIR), speech recognition,
artificial intelligence and expert systems, and so on.
One important area of application of NLP that is relatively new and has not been covered in
the previous ARIST chapters on NLP has become quite prominent due to the proliferation of
the world wide web and digital libraries. Several researchers have pointed out the need for
appropriate research in facilitating multi- or cross-lingual information retrieval, including
multilingual text processing and multilingual user interface systems, in order to exploit the
full benefit of the www and digital libraries.
Scope
Several ARIST chapters have reviewed the field of NLP. The most recent ones
include that by Warner in 1987, and Haas in 1996. Reviews of literature on large-scale NLP
systems, as well as the various theoretical issues have also appeared in a number of
publications (see for example, Jurafsky & Martin, 2000; Manning & Schutze, 1999; Mani &
Maybury, 1999; Sparck Jones, 1999; Wilks, 1996). Smeaton (1999) provides a good
overview of the past research on the applications of NLP in various information retrieval
tasks. Several ARIST chapters have appeared on areas related to NLP, such as on machine-
readable dictionaries (Amsler, 1984;Evans, 1989), speech synthesis and recognition (Lange,
1993), and cross-language information retrieval (Oard & Diekema, 1998). Research on NLP
is regularly published in a number of conferences such as the annual proceedings of ACL
(Association of Computational Linguistics) and its European counterpart EACL, biennial
proceedings of the International Conference on Computational Linguistics (COLING), annual
proceedings of the Message Understanding Conferences (MUCs), Text Retrieval Conferences
(TRECs) and ACM-SIGIR (Association of Computing Machinery – Special Interest Group
on Information Retrieval) conferences. The most prominent journals reporting NLP research
are Computational Linguistics and Natural Language Engineering. Articles reporting NLP
research also appear in a number of information science journals such as Information
Processing and Management, Journal of the American Society for Information Science and
Technology, and Journal of Documentation. Several researchers have also conducted domain-
specific NLP studies and have reported them in journals specifically dealing with the domain
concerned, such as the International Journal of Medical Informatics and Journal of Chemical
Information and Computer Science.
Beginning with the basic issues of NLP, this chapter aims to chart the major research
activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including:
i. natural language text processing systems – text summarization, information
extraction, information retrieval, etc., including domain-specific applications;
ii. natural language interfaces;
iii. NLP in the context of www and digital libraries ; and
iv. evaluation of NLP systems.
Linguistic research in information retrieval has not been covered in this review, since
this is a huge area and has been dealt with separately in this volume by David Blair.
Similarly, NLP issues related to the information retrieval tools (search engines, etc.) for web
search are not covered in this chapter since a separate chapter on indexing and retrieval for
the Web has been written in this volume by Edie Rasmussen. Tools and techniques developed
for building NLP systems have been discussed in this chapter along with the specific areas of
applications for which they have been built. Although machine translation (MT) is an
important part, and in fact the origin, of NLP research, this paper does not cover this topic
with sufficient detail since this is a huge area and demands a separate chapter on its own.
Similarly, cross-language information retrieval (CLIR), although is a very important and big
area in NLP research, is not covered in great detail in this chapter. A separate chapter on
CLIR research appeared in ARIST (Oard & Diekema, 1998). However, MT and CLIR have
become two important areas of research in the context of the www digital libraries. This
chapter reviews some works on MT and CLIR in the context of NLP and IR in digital
libraries and www. Artificial Intelligence techniques, including neural networks etc., used in
NLP have not been included in this chapter.
Some Theoretical Developments
Previous ARIST chapters (Haas, 1996; Warner, 1987) described a number of
theoretical developments that have influenced research in NLP. The most recent theoretical
developments can be grouped into four classes: (i) statistical and corpus-based methods in
NLP, (ii) recent efforts to use WordNet for NLP research, (iii) the resurgence of interest in
finite-state and other computationally lean approaches to NLP, and (iv) the initiation of
collaborative projects to create large grammar and NLP tools. Statistical methods are used in
NLP for a number of purposes, e.g., for word sense disambiguation, for generating grammars
and parsing, for determining stylistic evidences of authors and speakers, and so on. Charniak
(1995) points out that 90% accuracy can be obtained in assigning part-of-speech tag to a
word by applying simple statistical measures. Jelinek (1999) is a widely cited source on the
use of statistical methods in NLP, especially in speech processing. Rosenfield (2000) reviews
statistical language models for speech processing and argues for a Bayesian approach to the
integration of linguistic theories of data. Mihalcea & Moldovan (1999) mention that although
thus far statistical approaches have been considered the best for word sense disambiguation,
they are useful only in a small set of texts.
Figure 1 Langauge Translation - Using NLP
They propose the use of WordNet to improve the results of statistical analyses of
natural language texts. WordNet is an online lexical reference system developed at Princeton
University. This is an excellent NLP tool containing English nouns, verbs, adjectives and
adverbs organized into synonym sets, each representing one underlying lexical concept.
Details of WordNet is available in Fellbaum (1998) and on the web
(http://www.cogsci.princeton.edu/~wn/). WordNet is now used in a number of NLP research
and applications. One of the major applications of WordNet in NLP has been in Europe with
the formation EuroWordNet in 1996. EuroWordNet is a multilingual database with WordNets
for several European languages including Dutch, Italian, Spanish, German, French, Czech
and Estonian, structured in the same way as the WordNet for English
(http://www.hum.uva.nl/~ewn/). The finite-state automation is the mathematical device used
to implement regular expressions – the standard notation for characterizing text sequences.
Variations of automata such as finite-state transducers, Hidden Markov Models, and n-gram
grammars are important components of speech recognition and speech synthesis, spell-
checking, and information extraction which are the important applications of NLP. Different
applications of the Finite State methods in NLP have been discussed by Jurafsky & Martin
(2000), Kornai (1999) and Roche & Shabes (1997). The work of NLP researchers has been
greatly facilitated by the availability of large-scale grammar for parsing and generation.
Researchers can get access to large-scale grammars and tools through several websites, for
example Lingo (http://lingo.stanford.edu), Computational Linguistics & Phonetics
(http://www.coli.uni-sb.de/software.phtml), and Parallel grammar project
(http://www.parc.xerox.com/istl/groups/nltt/pargram/).
Another significant development in recent years is the formation of various national
and international consortia and research groups that can facilitate, and help share expertise,
research in NLP. LDC (Linguistic Data Consortium) (http://www.ldc.upenn.edu/) at the
University of Pennsylvania is a typical example that creates, collects and distributes speech
and text databases, lexicons, and other resources for research and development among
universities, companies and government research laboratories. The Parallel Grammar project
is another example of international cooperation. This is a collaborative effort involving
researchers from Xerox PARC in California, the University of Stuttgart and the University of
Konstanz in Germany, the University of Bergen in Norway, Fuji Xerox in Japan. The aim of
this project is to produce wide coverage grammars for English, French, German, Norwegian,
Japanese, and Urdu which are written collaboratively with a commonly-agreed-upon set of
grammatical features (http://www.parc.xerox.com/istl/groups/nltt/pargram/). The recently
formed Global WordNet Association is yet another example of cooperation. It is a non-
commercial organization that provides a platform for discussing, sharing and connecting
WordNets for all languages in the world. The first international WordNet conference to be
held in India in early 2002 is expected to address various problems of NLP by researchers
from different parts of the world.
2. SYSTEM ANALYSIS
2.1 Existing System
At the core of any NLP task there is the important issue of natural language
understanding. The process of building computer programs that understand natural language
involves three major problems: the first one relates to the thought process, the second one to
the representation and meaning of the linguistic input, and the third one to the world
knowledge. Thus, an NLP system may begin at the word level – to determine the
morphological structure, nature (such as part-of-speech, meaning) etc. of the word – and then
may move on to the sentence level – to determine the word order, grammar, meaning of the
entire sentence, etc.— and then to the context and the overall environment or domain. A
given word or a sentence may have a specific meaning or connotation in a given context or
domain, and may be related to many other words and/or sentences in the given context.
Liddy (2011) and Feldman (2012) implemented that in order to understand natural
languages, it is important to be able to distinguish among the following seven interdependent
levels, that people use to extract meaning from text or spoken languages:
phonetic or phonological level that deals with pronunciation
morphological level that deals with the smallest parts of words, that carry a meaning,
and suffixes and prefixes
lexical level that deals with lexical meaning of words and parts of speech analyses
syntactic level that deals with grammar and structure of sentences
semantic level that deals with the meaning of words and sentences
discourse level that deals with the structure of different kinds of text using document
structures and
Pragmatic level that deals with the knowledge that comes from the outside world, i.e.,
from outside the contents of the document.
A natural language processing system may involve all or some of these levels of analysis.
2.1.1 Disadvantages
When translation is required from one language to another, for example from French
to English, there are three basic methods that can be employed:
the translation of each phrase on a word for word basis,
hiring someone who speaks both languages, or
using translation software.
Using simple dictionaries for a word by word translation is very time consuming and can
often result in errors. Many words have different meanings in various contexts. And if the
reader of the translated material finds the wording funny, that can be a bad reflection on your
business. Allowing the gist of your material to be lost in translation can therefore mean the
loss of clients.
Hiring someone who speaks a couple of languages generally leads to much better
results. Therefore this option can be fine for small projects with an occasional need for
translations. When you need your information translated to several different languages
however, things get more complicated. In that situation you will probably need to find more
than one translator. Moreover; the translated sentences have:
No meaning full Sentences
Uses garbage collected variables for special characters
Not much reliable in Document Conversions
Much time to load in lower bandwidth
Minimal support in oldest web browsers
Required third-party scripting language to process
2.2 Proposed System
An apparatus for translating a series of source words in a first language to a series of
target words in a second language. For an input series of source words, at least two target
hypotheses, each including a series of target words, are generated. Each target word has a
context comprising at least one other word in the target hypothesis. For each target
hypothesis, a language model match score including an estimate of the probability of
occurrence of the series of words in the target hypothesis. At least one alignment connecting
each source word with at least one target word in the target hypothesis is identified. For each
source word and each target hypothesis, a word match score including an estimate of the
conditional probability of occurrence of the source word, given the target word in the target
hypothesis which is connected to the source word and given the context in the target
hypothesis of the target word which is connected to the source word. For each target
hypothesis, a translation match score including a combination of the word match scores for
the target hypothesis and the source words in the input series of source words. A target
hypothesis match score including a combination of the language model match score for the
target hypothesis and the translation match score for the target hypothesis. The target
hypothesis having the best target hypothesis match score is output.
The technique of creating interpolated language models for different contexts has
been used with success in a number of conversational interfaces [1, 2, 3] In this case, the
pertinent context is the system’s “dialogue state”, and it’s typical to group transcribed
utterances by dialogue state and build one language model per state. Typically, states with
little data are merged, and the state-specific language models are interpolated, or otherwise
merged. Language models corresponding to multiple states may also be interpolated, to share
information across similar states. The technique we develop here differs in two key respects.
First, we derive interpolation weights for thousands of recognition contexts, rather than a
handful of dialogue states. This makes it impractical to create each interpolated language
model offline and swap in the desired one at runtime. Our language models are large, and we
only learn the recognition context for a particular utterance when the audio starts to arrive.
Second, rather than relying on transcribed utterances from each recognition context to train
state-specific language modes, we instead interpolate a small number of language models
trained from large corpora.
2.2.1 Advantage
1. Understandable. For instance, if we translate an English text to our mother language
which is Malay, it is much more understandable by us.
2. Gain knowledge. Some say, no pain no gain. So, translating a literary text is no pain.
But just to say that we've to put effort on doing that. Literature is one of the branches
in learning a language. Therefore, we can know more or less on the literary texts like
Shakespeare's poems and such.
3. Widen vocabulary. Yada yada, we know that literary texts use all the Shakespeare's
bombastic classic English words, and by translating it into other languages might also
use those super-bombastic words, hence increasing our vocabulary indirectly.
4. Discipline your mind. As for those who are in a literature field, they can discipline
their minds by studying, researching and discovering new words and even cultures
that are in the texts that they translate. As a result, we will have our own experts on
translating literary texts that we do not have to import them.
5. Knowing history. We can learn and know the history in the literary text itself. For
example, the foreigners can know more on history of Malaysia by reading the Hikayat
Tun Sri Lanang and so forth and vice versa, we can know the other countries' cultures
by learning their literary texts. This will also lead to the knowledge of cultures,
politics and customs.
6. An efficient packet classification algorithm was introduced by hashing flow IDs held
in digest caches for reduced memory requirements at the expense of a small amount
of packet misclassification.
2.3 Feature Work
Register or context of situation is set of vocabularies and their meanings,
configuration of semantic pattern, along with words and structures such as double negative
(among black American) used in realization of these meanings.It relates variation of
language use to variations of social context. Every context has its distinctive vocabularies.
You can see a great difference in vocabularies used by mechanics in a garage and that of
doctors. Selection of meanings constitute variety to which a text belongs.
Halliday discusses the term Register in detail. This term refers to the relationship between
language (and other semiotic forms) and the features of the context. All along, we have been
characterizing this relationship (which we can now call register) by using the descriptive
categories of Field, Tenor, and Mode. Registers vary. There are clues or indices in the
language that help us predict what the register of a given text (spoken or written) is. Halliday
uses the example of the phrase "Once upon a time" as an indexical feature that signals to us
that we're about to read or hear a folk tale. Halliday also distinguishes between register and
another kind of language variety, dialect. For Halliday, dialect variety is a variety according
to the user. Dialects can be regional or social. Register is a variety according to use, or the
social activity in which you are engaged. Halliday says, "dialects are saying the same thing in
different ways, whereas registers are saying different things."
Register Variables delineate relationship between language function and language
form. To have a clear understanding of language form and function, we have an example
here. Consider words cats and dogs. Final s in both has the same written form. In cats it is
pronounced /s/, but in dogs it is pronounced /z/, so they have different spoken form. It
functions the same in both because it turns them into plural form. Language functions are also
of great importance.Some of language functions are vocative, aesthetic, phatic, metalingual,
informative, descriptive, expressive and social. Among them the last four ones are more
important here, so let's take a brief look at them.
Descriptive function gives actual information. You can test this information, then accept or
reject it.(It's – 10° outside. If it's winter it can be accepted. But in summer it will be rejected
in normal situation.).
Expressive function supplies information about speaker and his/her fleeing.(I don't invite her
again. It is implied that the speaker didn't like her in the first meeting.). Newmark believes
the core of expressive function is the mind of speaker, the writer or the originator of the
utterance. He uses the utterance to express his feelings irrespective of any response.
Social function shows particular relationship between speaker and listener.(Will that be all
sir? The sentence implies the context of a restaurant.).
Informative function Newmark believes the core of informative function of language is
external situation, the facts of a topic, reality outside language, including reported ideas or
theories. The format of an informative text is often standard: a textbook, a technical report, an
article in newspaper or a periodical, a scientific paper, a thesis, minutes or agenda of a
meeting
2.4 Feasibility Study
A feasibility study, also known as feasibility analysis, is an analysis of the viability of
an idea. It describes a preliminary study undertaken to determine and document a project’s
viability. The results of this analysis are used in making the decision whether to proceed with
the project or not. This analytical tool used during the project planning phrase shows how a
business would operate under a set of assumption, such as the technology used, the facilities
and equipment, the capital needs, and other financial aspects. The study is the first time in a
project development process that show whether the project create a technical and
economically feasible concept. As the study requires a strong financial and technical
background, outside consultants conduct most studies.
A feasible project is one where the project could generate adequate amount of cash
flow and profits, withstand the risks it will encounter, remain viable in the long-term and
meet the goals of the business. The venture can be a start-up of the new business, a purchase
of the existing business, and expansion of the current business. Consequently, costs and
benefits are estimated with greater accuracy at this stage.
Feasibility Considerations:
Three key considerations are involved in the feasibility study.
1. Economic feasibility
2. Technical feasibility
3. Operational feasibility
2.4.1 Economic Feasibility
Economic analysis could also be referred to as cost/benefit analysis. It is the most
frequently used method for evaluating the effectiveness of a new system. In economic
analysis the procedure is to determine the benefits and savings that are expected from a
candidate system and compare them with costs. If benefits outweigh costs, then the decision
is made to design and implement the system. An entrepreneur must accurately weigh the cost
versus benefits before taking an action.
Possible questions raised in economic analysis are:
Is the system cost effective?
Do benefits outweigh costs?
The cost of doing full system study
The cost of business employee time
Estimated cost of hardware
Estimated cost of software/software development
Is the project possible, given the resource constraints?
What are the savings that will result from the system?
Cost of employees' time for study
Cost of packaged software/software development
Selection among alternative financing arrangements (rent/lease/purchase)
The concerned business must be able to see the value of the investment it is pondering
before committing to an entire system study. If short-term costs are not overshadowed by
long-term gains or produce no immediate reduction in operating costs, then the system is not
economically feasible, and the project should not proceed any further. If the expected benefits
equal or exceed costs, the system can be judged to be economically feasible. Economic
analysis is used for evaluating the effectiveness of the proposed system.
The economic feasibility will review the expected costs to see if they are in-line with
the projected budget or if the project has an acceptable return on investment. At this point, the
projected costs will only be a rough estimate. The exact costs are not required to determine
economic feasibility. It is only required to determine if it is feasible that the project costs will
fall within the target budget or return on investment. A rough estimate of the project schedule
is required to determine if it would be feasible to complete the systems project within a
required timeframe. The required timeframe would need to be set by the organization.
2.4.2 Technical Feasibility
A large part of determining resources has to do with assessing technical feasibility. It
considers the technical requirements of the proposed project. The technical requirements are
then compared to the technical capability of the organization. The systems project is
considered technically feasible if the internal technical capability is sufficient to support the
project requirements.
The analyst must find out whether current technical resources can be upgraded or
added to in a manner that fulfils the request under consideration. This is where the expertise
of system analysts is beneficial, since using their own experience and their contact with
vendors they will be able to answer the question of technical feasibility.
The essential questions that help in testing the operational feasibility of a system include the
following:
Is the project feasible within the limits of current technology?
Does the technology exist at all?
Is it available within given resource constraints?
Is it a practical proposition?
Manpower- programmers, testers & debuggers
Software and hardware
Are the current technical resources sufficient for the new system?
Can they be upgraded to provide to provide the level of technology necessary for the
new system?
Do we possess the necessary technical expertise, and is the schedule reasonable?
Can the technology be easily applied to current problems?
Does the technology have the capacity to handle the solution?
Do we currently possess the necessary technology?
2.4.3 Operational Feasibility
Operational feasibility is dependent on human resources available for the project and involves
projecting whether the system will be used if it is developed and implemented. Operational
feasibility is a measure of how well a proposed system solves the problems, and takes
advantage of the opportunities identified during scope definition and how it satisfies the
requirements identified in the requirements analysis phase of system development.
Operational feasibility reviews the willingness of the organization to support the proposed
system. This is probably the most difficult of the feasibilities to gauge. In order to determine
this feasibility, it is important to understand the management commitment to the proposed
project. If the request was initiated by management, it is likely that there is management
support and the system will be accepted and used. However, it is also important that the
employee base will be accepting of the change.
The essential questions that help in testing the operational feasibility of a system include the
following:
Does current mode of operation provide adequate throughput and response time?
Does current mode provide end users and managers with timely, pertinent, accurate
and useful formatted information?
Does current mode of operation provide cost-effective information services to the
business?
Could there be a reduction in cost and or an increase in benefits?
Does current mode of operation offer effective controls to protect against fraud and to
guarantee accuracy and security of data and information?
Does current mode of operation make maximum use of available resources, including
people, time, and flow of forms?
Does current mode of operation provide reliable services?
Are the services flexible and expandable?
Are the current work practices and procedures adequate to support the new system?
If the system is developed, will it be used?
Manpower problems; Labour objections; Manager resistance
Organizational conflicts and policies
Social acceptability; Government regulations
Does management support the project?
Are the users not happy with current business practices?
Will it reduce the time (operation) considerably?
Have the users been involved in the planning and development of the project?
Will the proposed system really benefit the organization?
Does the overall response increase?
Will accessibility of information be lost?
Will the system affect the customers in considerable way?
How do the end-users feel about their role in the new system?
What end-users or managers may resist or not use the system?
How will the working environment of the end-user change?
Can or will end-users and management adapt to the change?
3 INTRODUCTION
3.1 Hardware Requirements
Processor - Intel Pentium dual core
Ram - 1GB
Hard disk - 80GB
Monitor - 17’inchs
Keyboard - Logitech
Mouse - optical mouse (Logitech)
3.2 Software Requirements
Front end - Java
Back End - MS SQL SERVER
Operating System - Windows-7
Tools Used - Net Beans
4 SOFTWARE DESCRIPTION
4.1 FRONT END
4.1.1 Java introduction
Java is an object-oriented programming language developed by Sun Microsystems
and it is also a powerful internet programming language. Java is a high-level programming
language which has the following features:
1. Object oriented
2. Portable
3. Architecture-neutral
4. High-performance
5. Multithreaded
6. Robust
7. Secure
Java is an efficient application programming language. It has APIs to support the GUI
based application development. The following features of java, makes it more suitable for
implementing this project. Initially the languages were called as “OAK” but it was renamed
as “Java” in 1995. The primary motivation of this language was the need for a platform
independent language that could be used to create software to be embedded in various
consumer electronic devices. Java is programmer’s language. Java is cohesive and consistent.
Except for those constraints imposed by the internet environment, Java gives the
programmer, full control. The excitement of the Internet attracted software vendors such that
Java development tools from many vendors quickly became available. That same excitement
has provided the impetus for a multitude of software developers to discover Java and its
many wonderful features.
With most programming languages, you either compile or interpret a program so that
you can run it on your computer. The Java programming language is unusual in that a
program is both compiled and interpreted. With the compiler, first you translate a program
into an intermediate language called Java byte codes the platform-independent codes
interpreted by the interpreter on the Java platform. The interpreter parses and runs each Java
byte code instruction on the computer. Compilation happens just once; interpretation occurs
each time the program is executed.
You can think of Java byte codes as the machine code instructions for the Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web
browser that can run applets, is an implementation of the Java VM. Java byte codes help
make “write once, run anywhere” possible. You can compile your program into byte codes on
any platform that has a Java compiler. The byte codes can then be run on any implementation
of the Java VM. That means that as long as a computer has a Java VM, the same program
written in the Java programming language can run on Windows 2000, a Solaris workstation,
or on an iMac.
The Java Platform
A platform is the hardware or software environment in which a program runs. We’ve already
mentioned some of the most popular platforms like Windows 2000, Linux, Solaris, and
MacOS. Most platforms can be described as a combination of the operating system and
hardware. The Java platform differs from most other platforms in that it’s a software-only
platform that runs on top of other hardware-based platforms.
The Java platform has two components:
The Java Virtual Machine (Java VM)
The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is
ported onto various hardware-based platforms. The Java API is a large collection of ready-
made software components that provide many useful capabilities, such as graphical user
interface (GUI) widgets. The Java API is grouped into libraries of related classes and
interfaces; these libraries are known as packages. The next section, What Can Java
Technology Do? Highlights what functionality some of the packages in the Java API provide.
The following figure depicts a program that’s running on the Java platform. As the figure
shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific hardware
platform. As a platform-independent environment, the Java platform can be a bit slower than
native code. However, smart compilers, well-tuned interpreters, and just-in-time byte code
compilers can bring performance close to that of native code without threatening portability.
What Can Java Technology Do?
The most common types of programs written in the Java programming language are
applets and applications. If you’ve surfed the Web, you’re probably already familiar with
applets. An applet is a program that adheres to certain conventions that allow it to run within
a Java-enabled browser. However, the Java programming language is not just for writing
cute, entertaining applets for the Web. The general-purpose, high-level Java programming
language is also a powerful software platform. Using the generous API, you can write many
types of programs.
An application is a standalone program that runs directly on the Java platform. A
special kind of application known as a server serves and supports clients on a network.
Examples of servers are Web servers, proxy servers, mail servers, and print servers. Another
specialized program is a servlet. A servlet can almost be thought of as an applet that runs on
the server side. Java Servlets are a popular choice for building interactive web applications,
replacing the use of CGI scripts.
Servlets are similar to applets in that they are runtime extensions of applications.
Instead of working in browsers, though, servlets run within Java Web servers, configuring or
tailoring the server.
How does the API support all these kinds of programs? It does so with packages of software
components that provides a wide range of functionality. Every full implementation of the
Java platform gives you the following features:
The essentials: Objects, strings, threads, numbers, input and output, data structures,
system properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram
Protocol) sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users
worldwide. Programs can automatically adapt to specific locales and be displayed in
the appropriate language.
Security: Both low level and high level, including electronic signatures, public and
private key management, access control, and certificates.
Software components: Known as JavaBeans TM, can plug into existing component
architectures.
Object serialization: Allows lightweight persistence and communication via Remote
Method Invocation (RMI).
Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of
relational databases.
The Java platform also has APIs for 2D and 3D graphics, accessibility, servers, collaboration,
telephony, speech, animation, and more. The following figure depicts what is included in the
Java 2 SDK.
How Will Java Technology Change My Life?
We can’t promise you fame, fortune, or even a job if you learn the Java programming
language. Still, it is likely to make your programs better and requires less effort than other
languages. We believe that Java technology will help you do the following:
Get started quickly: Although the Java programming language is a powerful object-
oriented language, it’s easy to learn, especially for programmers already familiar with
C or C++.
Write less code: Comparisons of program metrics (class counts, method counts, and
so on) suggest that a program written in the Java programming language can be four
times smaller than the same program in C++.
Write better code: The Java programming language encourages good coding
practices, and its garbage collection helps you avoid memory leaks. Its object
orientation, its JavaBeans component architecture, and its wide-ranging, easily
extendible API let you reuse other people’s tested code and introduce fewer bugs.
Develop programs more quickly: Your development time may be as much as twice
as fast versus writing the same program in C++. Why? You write fewer lines of code
and it is a simpler programming language than C++.
Avoid platform dependencies with 100% Pure Java: You can keep your program
portable by avoiding the use of libraries written in other languages. The 100% Pure
JavaTM Product Certification Program has a repository of historical process manuals,
white papers, brochures, and similar materials online.
Write once, run anywhere: Because 100% Pure Java programs are compiled into
machine-independent byte codes, they run consistently on any Java platform.
Distribute software more easily: You can upgrade applets easily from a central
server. Applets take advantage of the feature of allowing new classes to be loaded “on
the fly,” without recompiling the entire program.
ODBC
Microsoft Open Database Connectivity (ODBC) is a standard programming interface
for application developers and database systems providers. Before ODBC became a defacto
standard for Windows programs to interface with database systems, programmers had to use
proprietary languages for each database they wanted to connect to. Now, ODBC has made the
choice of the database system almost irrelevant from a coding perspective, which is as it
should be. Application developers have much more important things to worry about than the
syntax that is needed to port their program from one database to another when business needs
suddenly change.
Through the ODBC Administrator in Control Panel, you can specify the particular
database that is associated with a data source that an ODBC application program is written
to use. Think of an ODBC data source as a door with a name on it. Each door will lead you to
a particular database. For example, the data source named Sales Figures might be a SQL
Server database, whereas the Accounts Payable data source could refer to an Access
database. The physical database referred to by a data source can reside anywhere on the
LAN.
The ODBC system files are not installed on your system by Windows 95. Rather, they are
installed when you setup a separate database application, such as SQL Server Client or Visual
Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called
ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-
alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this
program and each maintains a separate list of ODBC data sources. From a programming
perspective, the beauty of ODBC is that the application can be written to use the same set of
function calls to interface with any data source, regardless of the database vendor. The source
code of the application doesn’t change whether it talks to Oracle or SQL Server. We only
mention these two as an example. There are ODBC drivers available for several dozen
popular database systems. Even Excel spreadsheets and plain text files can be turned into data
sources. The operating system uses the Registry information written by ODBC Administrator
to determine which low-level ODBC drivers are needed to talk to the data source (such as the
interface to Oracle or SQL Server). The loading of the ODBC drivers is transparent to the
ODBC application program. In a client/server environment, the ODBC API even handles
many of the network issues for the application programmer.
The advantages of this scheme are so numerous that you are probably thinking there
must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as talking
directly to the native database interface. ODBC has had many detractors make the charge that
it is too slow. Microsoft has always claimed that the critical factor in performance is the
quality of the driver software that is used. In our humble opinion, this is true. The availability
of good ODBC drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never match the
speed of pure assembly language.
Maybe not, but the compiler (or ODBC) gives you the opportunity to write cleaner programs,
which means you finish sooner. Meanwhile, computers get faster every year.
JDBC
In an effort to set an independent database standard API for Java; Sun Microsystems
developed Java Database Connectivity, or JDBC. JDBC offers a generic SQL database access
mechanism that provides a consistent interface to a variety of RDBMSs. This consistent
interface is achieved through the use of “plug-in” database connectivity modules, or drivers.
If a database vendor wishes to have JDBC support, he or she must provide the driver for each
platform that the database and Java run on.
To gain a wider acceptance of JDBC, Sun based JDBC’s framework on ODBC. As
you discovered earlier in this chapter, ODBC has widespread support on a variety of
platforms. Basing JDBC on ODBC will allow vendors to bring JDBC drivers to market much
faster than developing a completely new connectivity solution.
JDBC was announced in March of 1996. It was released for a 90 day public review
that ended June 8, 1996. Because of user input, the final JDBC v1.0 specification was
released soon after. The remainder of this section will cover enough information about JDBC
for you to know what it is about and how to use it effectively. This is by no means a complete
overview of JDBC. That would fill an entire book.
JDBC Goals
Few software packages are designed without goals in mind. JDBC is one that, because
of its many goals, drove the development of the API. These goals, in conjunction with early
reviewer feedback, have finalized the JDBC class library into a solid framework for building
database applications in Java.
The goals that were set for JDBC are important. They will give you some insight as to
why certain classes and functionalities behave the way they do. The eight design goals for
JDBC are as follows:
1. SQL Level API
The designers felt that their main goal was to define a SQL interface for Java.
Although not the lowest database interface level possible, it is at a low enough level for
higher-level tools and APIs to be created. Conversely, it is at a high enough level for
application programmers to use it confidently. Attaining this goal allows for future tool
vendors to “generate” JDBC code and to hide many of JDBC’s complexities from the end
user.
2. SQL Conformance
SQL syntax varies as you move from database vendor to database vendor. In an effort to
support a wide variety of vendors, JDBC will allow any query statement to be passed through
it to the underlying database driver. This allows the connectivity module to handle non-
standard functionality in a manner that is suitable for its users.
3. JDBC must be implemental on top of common database interfaces
The JDBC SQL API must “sit” on top of other common SQL level APIs. This goal allows
JDBC to use existing ODBC level drivers by the use of a software interface. This interface
would translate JDBC calls to ODBC and vice versa.
4. Provide a Java interface that is consistent with the rest of the Java system
Because of Java’s acceptance in the user community thus far, the designers feel that they
should not stray from the current design of the core Java system.
5. Keep it simple
This goal probably appears in all software design goal listings. JDBC is no exception. Sun
felt that the design of JDBC should be very simple, allowing for only one method of
completing a task per mechanism. Allowing duplicate functionality only serves to confuse the
users of the API.
6. Use strong, static typing wherever possible
Strong typing allows for more error checking to be done at compile time; also, less
error appear at runtime.
7. Keep the common cases simple
Because more often than not, the usual SQL calls used by the programmer are simple
SELECT’s, INSERT’s, DELETE’s and UPDATE’s, these queries should be simple to
perform with JDBC. However, more complex SQL statements should also be possible.
Finally we decided to proceed the implementation using Java Networking. And for
dynamically updating the cache table we go for MS Access database. Java has two things: a
programming language and a platform.
You can think of Java byte codes as the machine code instructions for the, Java
Virtual Machine (Java VM). Every Java interpreter, whether it’s a Java development tool or a
Web browser that can run Java applets, is an implementation of the Java VM. The Java VM
can also be implemented in hardware.
Java byte codes help make “write once, run anywhere” possible. You can compile your Java
program into byte codes on my platform that has a Java compiler. The byte codes can then be
run any implementation of the Java VM. For example, the same Java program can run
Windows NT, Solaris, and Macintosh.
Networking
TCP/IP stack
The TCP/IP stack is shorter than the OSI one:
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless
protocol.
IP datagram’s
The IP layer provides a connectionless and unreliable delivery system. It considers each
datagram independently of the others. Any association between datagram must be supplied
by the higher layers. The IP layer supplies a checksum that includes its own header. The
header includes the source and destination addresses. The IP layer handles routing through an
Internet. It is also responsible for breaking up large datagram into smaller ones for
transmission and reassembling them at the other end.
UDP
UDP is also connectionless and unreliable. What it adds to IP is a checksum for the contents
of the datagram and port numbers. These are used to give a client/server model - see later.
TCP
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a
virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme for
machines so that they can be located. The address is a 32 bit integer which gives the IP
address. This encodes a network ID and more addressing. The network ID falls into various
classes according to the size of the network address.
Network address
Class A uses 8 bits for the network address with 24 bits left over for other addressing. Class
B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses all
32.
Subnet address
Internally, the UNIX network is divided into sub networks. Building 11 is currently on one
sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address
8 bits are finally used for host addresses within our subnet. This places a limit of 256
machines that can be on the subnet.
Total address
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a
message to a server, you send it to the port for that service of the host that it is running on.
This is not location transparency! Certain of these ports are "well known".
Sockets
A socket is a data structure maintained by the system to handle network connections. A
socket is created using the call socket. It returns an integer that is like a file descriptor. In
fact, under Windows, this handle can be used with Read File and Write File functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here "family" will be AF_INET for IP communications, protocol will be zero, and type will
depend on whether TCP or UDP is used. Two processes wishing to communicate over a
network create a socket each. These are similar to two ends of a pipe - but the actual pipe
does not yet exist.
JFree Chart
JFreeChart is a free 100% Java chart library that makes it easy for developers to display
professional quality charts in their applications. JFreeChart's extensive feature set includes: A
consistent and well-documented API, supporting a wide range of chart types.
A flexible design that is easy to extend, and targets both server-side and client-side
applications; Support for many output types, including Swing components, image files
(including PNG and JPEG), and vector graphics file formats (including PDF, EPS and SVG);
JFreeChart is "open source" or, more specifically, free software. It is distributed under the
terms of the GNU Lesser General Public License (LGPL), which permits use in proprietary
applications.
1. Map Visualizations
Charts showing values that relate to geographical areas. Some examples include: (a)
population density in each state of the United States, (b) income per capita for each
country in Europe, (c) life expectancy in each country of the world.
The tasks in this project include:
Sourcing freely redistributable vector outlines for the countries of the world, states/provinces
in particular countries (USA in particular, but also other areas); Creating an appropriate
dataset interface (plus default implementation), a rendered, and integrating this with the
existing XYPlot class in JFreeChart; Testing, documenting, testing some more, documenting
some more.
2. Time Series Chart Interactivity
Implement a new (to JFreeChart) feature for interactive time series charts --- to display a
separate control that shows a small version of ALL the time series data, with a sliding "view"
rectangle that allows you to select the subset of the time series data to display in the main
chart.
3. Dashboards
There is currently a lot of interest in dashboard displays. Create a flexible dashboard
mechanism that supports a subset of JFreeChart chart types (dials, pies, thermometers, bars,
and lines/time series) that can be delivered easily via both Java Web Start and an applet.
4. Property Editors
The property editor mechanism in JFreeChart only handles a small subset of the properties
that can be set for charts. Extend (or re-implement) this mechanism to provide greater end-
user control over the appearance of the charts.
J2ME (Java 2 Micro edition):-
Sun Microsystems defines J2ME as "a highly optimized Java run-time environment targeting
a wide range of consumer products, including pagers, cellular phones, screen-phones, digital
set-top boxes and car navigation systems." Announced in June 1999 at the Java One
Developer Conference, J2ME brings the cross-platform functionality of the Java language to
smaller devices, allowing mobile wireless devices to share applications. With J2ME, Sun has
adapted the Java platform for consumer products that incorporate or are based on small
computing devices.
1. General J2ME architecture
J2ME uses configurations and profiles to customize the Java Runtime Environment (JRE). As
a complete JRE, J2ME is comprised of a configuration, which determines the JVM used, and
a profile, which defines the application by adding domain-specific classes. The configuration
defines the basic run-time environment as a set of core classes and a specific JVM that run on
specific types of devices. We'll discuss configurations in detail in the The profile defines the
application; specifically, it adds domain-specific classes to the J2ME configuration to define
certain uses for devices. The following graphic depicts the relationship between the different
virtual machines, configurations, and profiles
It also draws a parallel with the J2SE API and its Java virtual machine. While the J2SE
virtual machine is generally referred to as a JVM, the J2ME virtual machines, KVM and
CVM, are subsets of JVM. Both KVM and CVM can be thought of as a kind of Java virtual
machine -- it's just that they are shrunken versions of the J2SE JVM and are specific to
J2ME.
2. Developing J2ME applications
Introduction In this section, we will go over some considerations you need to keep in mind
when developing applications for smaller devices. We'll take a look at the way the compiler
is invoked when using J2SE to compile J2ME applications. Finally, we'll explore packaging
and deployment and the role pre-verification plays in this process.
3. Design considerations for small devices
Developing applications for small devices requires you to keep certain strategies in mind
during the design phase. It is best to strategically design an application for a small device
before you begin coding. Correcting the code because you failed to consider all of the
"gotchas" before developing the application can be a painful process. Here are some design
strategies to consider:
* Keep it simple. Remove unnecessary features, possibly making those features a separate,
secondary application.
* Smaller is better. This consideration should be a "no brainer" for all developers. Smaller
applications use less memory on the device and require shorter installation times. Consider
packaging your Java applications as compressed Java Archive (jar) files.
* Minimize run-time memory use. To minimize the amount of memory used at run time, use
scalar types in place of object types. Also, do not depend on the garbage collector. You
should manage the memory efficiently yourself by setting object references to null when you
are finished with them. Another way to reduce run-time memory is to use lazy instantiation,
only allocating objects on an as-needed basis.
Other ways of reducing overall and peak memory use on small devices are to release
resources quickly, reuse objects, and avoid exceptions.
4. Configurations overview
The configuration defines the basic run-time environment as a set of core classes and a
specific JVM that run on specific types of devices. Currently, two configurations exist for
J2ME, though others may be defined in the future:
* Connected Limited Device Configuration (CLDC) is used specifically with the KVM for
16-bit or 32-bit devices with limited amounts of memory. This is the configuration (and the
virtual machine) used for developing small J2ME applications. Its size limitations make
CLDC more interesting and challenging (from a development point of view) than CDC.
CLDC is also the configuration that we will use for developing our drawing tool application.
An example of a small wireless device running small applications is a Palm hand-held
computer.
* Connected Device Configuration (CDC) is used with the C virtual machine (CVM) and is
used for 32-bit architectures requiring more than 2 MB of memory. An example of such a
device is a Net TV box.
J2ME profiles
What is a J2ME profile?
As we mentioned earlier in this tutorial, a profile defines the type of device supported. The
Mobile Information Device Profile (MIDP), for example, defines classes for cellular phones.
It adds domain-specific classes to the J2ME configuration to define uses for similar devices.
Two profiles have been defined for J2ME and are built upon CLDC: KJava and MIDP. Both
KJava and MIDP are associated with CLDC and smaller devices. Profiles are built on top of
configurations. Because profiles are specific to the size of the device (amount of memory) on
which an application runs, certain profiles are associated with certain configurations.
A skeleton profile upon which you can create your own profile, the Foundation Profile, is
available for CDC.
Profile 1: KJava
KJava is Sun's proprietary profile and contains the KJava API. The KJava profile is built on
top of the CLDC configuration. The KJava virtual machine, KVM, accepts the same byte
codes and class file format as the classic J2SE virtual machine. KJava contains a Sun-specific
API that runs on the Palm OS. The KJava API has a great deal in common with the J2SE
Abstract Windowing Toolkit (AWT). However, because it is not a standard J2ME package,
its main package is com.sun.kjava. We'll learn more about the KJava API later in this tutorial
when we develop some sample applications.
Profile 2: MIDP
MIDP is geared toward mobile devices such as cellular phones and pagers. The MIDP, like
KJava, is built upon CLDC and provides a standard run-time environment that allows new
applications and services to be deployed dynamically on end user devices. MIDP is a
common, industry-standard profile for mobile devices that is not dependent on a specific
vendor. It is a complete and supported foundation for mobile application development. MIDP
contains the following packages, the first three of which are core CLDC packages, plus three
MIDP-specific packages.
* java.lang
* java.io
* java.util
* javax.microedition.io
* javax.microedition.lcdui
* javax.microedition.midlet
* javax.microedition.rms
5 PROJECT DESCRIPTION
5.1 System Architecture
Information retrieval has been a major area of application of NLP, and consequently a
number of research projects, dealing with the various applications on NLP in IR, have taken
place throughout the world resulting in a large volume of publications. Lewis and Sparck
Jones (2013) comment that the generic challenge for NLP in the field of IR is whether the
necessary NLP of texts and queries is doable, and the specific challenges are whether non-
statistical and statistical data can be combined and whether data about individual documents
and whole files can be combined. They further comment that there are major challenges in
making the NLP technology operate effectively and efficiently and also in conducting
appropriate evaluation tests to assess whether and how far the approach works in an
environment of interactive searching of large text files. Feldman (2013) suggests that in order
to achieve success in IR, NLP techniques should be applied in conjunction with other
technologies, such as visualization, intelligent agents and speech recognition.
Arguing that syntactic phrases are more meaningful than statistically obtained word pairs,
and thus are more powerful for discriminating among documents, Narita and Ogawa (2012)
use a shallow syntactic processing instead of statistical processing to automatically identify
candidate phrasal terms from query texts. Comparing the performance of Boolean and natural
language searches, Paris and Tibbo (2013) found that in their experiment, Boolean searches
had better results than freestyle (natural language) searches. However, they concluded that
neither could be considered as the best for every query. In other words, their conclusion was
that different queries demand different techniques.
Variations in presenting subject matter greatly affect IR and hence linguistic variation of
document texts is one of the greatest challenges to IR. In order to investigate how
consistently newspapers choose words and concepts to describe an event, Lehtokangas &
Jarvelin (2011) chose articles on the same news from three Finnish newspapers. Their
experiment revealed that for short newswire the consistency was 83% and for long articles
47%. It was also revealed that the newspapers were very consistent in using concepts to
represent events, with a level of consistency varying between 92-97%.
Natural Language Interfaces
A natural language interface is one that accepts query statements or commands in
natural language and sends data to some system, typically a retrieval system, which then
results in appropriate responses to the commands or query statements. A natural language
interface should be able to translate the natural language statements into appropriate actions
for the system. A large number of natural language interfaces that work reasonably well in
narrow domains have been reported in the literature. Much of the efforts in natural language
interface design to date have focused on handling rather simple natural language queries. A
number of question answering systems are now being developed that aim to provide answers
to natural language questions, as opposed to documents containing information related to the
question. Such systems often use a variety of IE and IR operations using NLP tools and
techniques to get the correct answer from the source texts. Breck et al. (2012) report a
question answering system that uses techniques from knowledge representation, information
retrieval, and NLP. The authors claim that this combination enables domain independence
and robustness in the face of text variability, both in the question and in the raw text
documents used as knowledge sources. Research reported in the Question Answering (QA)
track of TREC (Text Retrieval Conferences) show some interesting results. The basic
technology used by the participants in the QA track included several steps. First, cue
words/phrase like ‘who’ (as in ‘who is the prime minister of Japan’), ‘when’ (as in ‘When did
the Jurassic period end’) were identified to guess what was needed; and then a small portion
of the document collection was retrieved using standard text retrieval technology.
This was followed by a shallow parsing of the returned documents for identifying the
entities required for an answer. If no appropriate answer type was found then best matching
passage was retrieved. This approach works well as long as the query types recognized by the
system have broad coverage, and the system can classify questions reasonably accurately. In
TREC-8, the first QA track of TREC, the most accurate QA systems could answer more than
2/3 of the questions correctly. In the second QA track (TREC-9), the best performing QA
system, the Falcon system from Southern Methodist University, was able to answer 65% of
the questions (Voorhees, 2000). These results are quite impressive in a domain-independent
question answering environment. However, the questions were still simple in the first two
QA tracks. In the future more complex questions requiring answers to be obtained from more
than one documents will be handled by QA track researchers.
The Natural Language Processing Laboratory, Centre for Intelligent Information
Retrieval at the University of Massachusetts, distributes source codes and executables to
support IE system development efforts at other sites. Each module is designed to be used in a
domain-specific and task-specific customizable IE system. Available software includes
(Natural Language …, n.d.)
MARMOT Text Bracketing Module, a text file translator which segments arbitrary
text blocks into sentences, applies low-level specialists such as date recognizers,
associates words with part-of-speech tags, and brackets the text into annotated noun
phrases, prepositional phrases, and verb phrases.
BADGER Extraction Module that analyses bracketed text and produces case frame
instantiations according to application-specific domain guidelines.
CRYSTAL Dictionary Induction Module, which learns text extraction rules, suitable
for use by BADGER, from annotated training texts.
ID3-S Inductive Learning Module, a variant on ID3 which induces decision trees on
the basis of training examples.
5.2 MODULE DESCRIPTION
Section Splitter
Section Filter
Text Tokenizer
Part-of-Speech (POS) Tagger
Noun Phrase Finder
UMLS Concept Finder
Negation Finder
Regular Expression-based Concept Finder
Sentence Splitter
N-Gram Tool
Classifier (e.g. Smoking Status Classifier)
5.2.1 Section Splitter
For this project, you will write the lexical analysis phase (i.e., the "scanner") of a simple
compiler for a subset of the language "Tubular". We will start with only one type of variable
("int"), basic math, and the print command to output results; basically it will be little more
than a standard calculator. Over the next two projects we will turn this into a working
compiler, and in the four projects following that, we will expand the functionality and
efficiency of the language. The program you turn in must load in the source file (as a
command-line argument) and process it line-by-line, removing whitespace and comments and
categorizing each word or symbol as a token. A token is a type of unit that appears in a
source file. You will then output (to standard out) a tokenized version of the file, as described
in more detail below. A pattern is a rule by which a token is identified. It will be up to you
as part of this project to identify the patterns associated with each token. A lexeme is an
instance of a pattern from a source file. For example, "42" is a lexeme that would be
categorized as the token STATIC_INT. On the other hand "my_var" is a lexeme that would
get identified as a token of the type ID. When multiple patterns match the text being
processed, choose the one that produces the longest lexeme that starts at the current position.
If two different patterns produce lexemes of the same length, choose the one that
comes first in the list above. For example, the string "print" might be incorrectly read as the
ID "pr" followed by the TYPE "int", but "print" is longer than "pr", so it should be chosen.
Likewise, the lexeme "print" could match either the pattern or COMMAND or the pattern for
ID, but COMMAND should be chosen since it comes first in the list in the table.
Each student must write this project on their own, with no help from other students or
any other individuals, though you may use whatever pre-existing web resources you like. I
will use your score on this project and your programing style as factors in assembling groups
for future projects. As such, it's well worth putting extra effort into this project. Your lexer
should be able to identify each of the following tokens that it encounters:
Token DescriptionTYPE Data types: currently just "int", but more types will be introduced in
future projects.COMMAND Any built-in commands: currently just "print".ID A sequence beginning with a letter or underscore ('_'), followed by a
sequence of zero or more characters that may contain letters, numbers and underscores. Currently just variable names.
STATIC_INT Any static interger. We will implement static floating point numbers in a future project, as well as other static types.
OPERATOR Math operators: + - * / % ( ) = += -= *= /= %=SEPARATOR List separation: ,ENDLINE Signifies the end of a statement -- semicolon: ;WHITESPACE
Any number of consecutive spaces, tabs, or newlines.
COMMENT Everything on a line following a pound-sign, '#'.UNKNOWN An unknown character or a sequence that does not match any of the
tokens above.
5.2.2 Section Filter
The fundamental component of a tagger is a POS (part-of-speech) tagset, or list of all
the word categories that will be used in the tagging process. The tagsets used to annotate
large corpora in the past have usually been fairly extensive. The pioneering Brown Corpus
distinguishes 87 simple tags. Subsequent projects have tended to elaborate the Brown Corpus
tagset. For instance, the Lancaster-Oslo/Bergen(LOB) Corpus uses about 135 tags, the
Lancaster UCREL group about 165 tags, and the London-Lund Corpus of Spoken English
197 tags. The rationale behind developing such large, richly articulated tagsets is to approach
"the ideal of providing distinct codings for all classes of words having distinct grammatical
behaviour". Choosing an appropriate set of tags for an annotation system has direct influence
on the accuracy and the usefulness of the tagging system. The larger the tag set, the lower the
accuracy of the tagging system since the system has to be capable of making finer
distinctions. Conversely, the smaller the tag set, the higher the accuracy. However, a very
small tag set tends to make the tagging system less useful since it provides less information.
So, there is a trade-off here. Another issue in tag-set design is the consistency of the tagging
system. Words of the same meaning and same functions should be tagged with the same tags.
5.2.3 Text Tokenizer
Machine Learning methods usually require supervised data to learn a concept.
Labelling data is time consuming, tedious, error prone and expensive. The research
community has looked at semi-supervised and unsupervised learning techniques in order to
obviate the need of labelled data to a certain extent. In addition to the above mentioned
problems with labelled data, all examples are not equally informative or equally easy to label.
For instance, the examples similar to what the learner has already seen are not as useful as
new examples. Moreover, deferent examples may require deferent amount of user's labelling
report, for instance, a longer sentence is likely to have more ambiguities and hence would be
harder to parse manually. Active learning is the task of reducing the amount of labelled data
required to learn the target concept by querying the user for labels for the most informative
examples so that the concept is learnt with fewer examples. An active learning problem
setting typically consists of a small set of labelled examples and a large set of unlabelled
examples. An initial classier is trained on the labelled examples and/or the unlabelled
examples. From the pool of unlabelled examples, selective sampling is used to create a small
subset of examples for the user to label. This iterative process of training, selective sampling
and annotation is repeated until convergence.
Cross-cultural communication, as in many scholarly fields, is a combination of many other fields. These fields include anthropology, cultural studies, psychology and communication. The field has also moved both toward the treatment of interethnic relations, and toward the study of communication strategies used by co-cultural populations, i.e., communication strategies used to deal with majority or mainstream populations.
Software-oriented mechanisms are less expensive and more flexible in filter lookups
when compared with their hardware-centric counterparts. Such mechanisms are
abundant, commonly involving efficient algorithms for quick packet classification with
an aid of caching or hashing. Their Classification speeds rely on efficiency in search over
the rule set using the keys constituted by corresponding header fields. Several
representative software classification techniques are reviewed in sequence.
5.2.4 Part-of-Speech (POS) Tagger
An active learning experiment is usually described by ¯ve properties: number of
bootstrap examples, batch size, supervised learner, data set and a stopping criterion. The
supervised learner is trained on the bootstrap examples which are labelled by the user
initially. Batch size is the number of examples that are selectively sampled from the
unlabelled pool and added to the training pool in each iteration. The stopping criterion can be
either a desired performance level or the number of iterations. Performance is evaluated on
the test set in each iteration.
Active learners are usually evaluated by plotting a learning curve of performance vs.
number of labelled examples as shown in ¯gore 1. Success of an active learner is demon-
started by showing that it achieves better performance than a traditional learner given the
same number of labelled examples; i.e., for achieving the desired performance, the active
learner needs fewer examples than the traditional learner.
5.2.5 Noun Phrase Finder
Active learning aims at reducing the number of examples required to achieve the desired
accuracy by selectively sampling the examples for user to label and train the classier with.
Several deferent strategies for selective sampling have been explored in the literature. In this
review, we present some of the selective sampling techniques used for active learning in
NLP. Uncertainty-based sampling selects examples that the model is least certain about and
presents them to the user for correction/verification. A lot of work on active learning has used
uncertainty-based sampling. In this section, we describe some of this work.
5.2.6 UMLS Concept Finder
Parsers that parameterize over wider scopes are generally more accurate than edge-
factored models. For graph-based non-projective parsers, wider factorizations have so far
implied large increases in the computational complexity of the parsing problem. This paper
introduces a “crossing-sensitive” generalization of a third-order factorization that trades off
complexity in the model structure (i.e., scoring with features over multiple edges) with
complexity in the output structure (i.e., producing crossing edges). Under this model, the
optimal 1-Endpoint-Crossing tree can be found in O(n^4) time, matching the asymptotic run-
time of both the third-order projective parser and the edge-factored 1-Endpoint-Crossing
parser. The crossing-sensitive third-order parser is significantly more accurate than the third-
order projective parser under many experimental settings and significantly less accurate on
none.
5.2.7 Negation Finder
Although many NLP systems are moving toward entity-based processing, most still identify
important phrases using classical keyword-based approaches. To bridge this gap, we
introduce the task of entity salience: assigning a relevance score to each entity in a document.
We demonstrate how a labeled corpus for the task can be automatically generated from a
corpus of documents and accompanying abstracts. We then show how a classifier with
features derived from a standard NLP pipeline outperforms a strong baseline by 34%. Finally,
we outline initial experiments on further improving accuracy by leveraging background
knowledge about the relationships between entities.
5.2.8 Regular Expression-based Concept Finder
Frame semantics is a linguistic theory that has been instantiated for English in the FrameNet
lexicon. We solve the problem of frame-semantic parsing using a two-stage statistical model
that takes lexical targets (i.e., content words and phrases) in their sentential contexts and
predicts frame-semantic structures. Given a target in context, the first stage disambiguates it
to a semantic frame. This model employs latent variables and semi-supervised learning to
improve frame disambiguation for targets unseen at training time. The second stage finds the
target's locally expressed semantic arguments. At inference time, a fast exact dual
decomposition algorithm collectively predicts all the arguments of a frame at once in order to
respect declaratively stated linguistic constraints, resulting in qualitatively better structures
than naïve local predictors. Both components are feature-based and discriminatively trained
on a small set of annotated frame-semantic parses. On the SemEval 2007 benchmark dataset,
the approach, along with a heuristic identifier of frame-evoking targets, outperforms the prior
state of the art by significant margins. Additionally, we present experiments on the much
larger Frame Net 1.5 dataset. We have released our frame-semantic parser as open-source
software.
5.2.9 Sentence Splitter
Mobile is poised to become the predominant platform over which people are
accessing the World Wide Web. Recent developments in speech recognition and
understanding, backed by high bandwidth coverage and high quality speech signal acquisition
on smartphones and tablets are presenting the users with the choice of speaking their web
search queries instead of typing them. A critical component of a speech recognition system
targeting web search is the language model. The chapter presents an empirical exploration of
the google.com query stream with the end goal of high quality statistical language modeling
for mobile voice search. Our experiments show that after text normalization the query stream
is not as ``wild'' as it seems at first sight. One can achieve out-of-vocabulary rates below 1%
using a one million word vocabulary, and excellent n-gram hit ratios of 77/88% even at high
orders such as n=5/4, respectively. A more careful analysis shows that a significantly larger
vocabulary (approx. 10 million words) may be required to guarantee at most 1% out-of-
vocabulary rate for a large percentage (95%) of users. Using large scale, distributed language
models can improve performance significantly---up to 10% relative reductions in word-error-
rate over conventional models used in speech recognition. We also find that the query stream
is non-stationary, which means that adding more past training data beyond a certain point
provides diminishing returns, and may even degrade performance slightly. Perhaps less
surprisingly, we have shown that locale matters significantly for English query data across
USA, Great Britain and Australia. In an attempt to leverage the speech data in voice search logs,
we successfully build large-scale discriminative N-gram language models and derive small but
significant gains in recognition performance.
5.2.10 N-Gram Tool
Empty categories (EC) are artificial elements in Penn Treebanks motivated by the
government-binding (GB) theory to explain certain language phenomena such as pro-drop.
ECs are ubiquitous in languages like Chinese, but they are tacitly ignored in most machine
translation (MT) work because of their elusive nature. In this paper we present a
comprehensive treatment of ECs by first recovering them with a structured MaxEnt model
with a rich set of syntactic and lexical features, and then incorporating the predicted ECs into
a Chinese-to-English machine translation task through multiple approaches, including the
extraction of EC-specific sparse features. We show that the recovered empty categories not
only improve the word alignment quality, but also lead to significant improvements in a
large-scale state-of-the-art syntactic MT system.
5.2.11 Classifier (e.g. Smoking Status Classifier)
Many highly engineered NLP systems address the benchmark tasks using linear statistical
models applied to task-specific features. In other words, the researchers themselves discover
intermediate representations by engineering ad-hoc features. These features are often derived
from the output of pre-existing systems, leading to complex runtime dependencies. This
approach is effective because researchers leverage a large body of linguistic knowledge. On
the other hand, there is a great temptation to over-engineer the system to optimize its
performance on a particular benchmark at the expense of the broader NLP goals.
In this contribution, we describe a unified NLP system that achieves excellent performance
on multiple benchmark tasks by discovering its own internal representations. We have
avoided engineering features as much as possible and we have therefore ignored a large body
of linguistic knowledge. Instead we reach state-of-the-art performance levels by transferring
intermediate representations discovered on massive unlabelled datasets. We call this
approach “almost from scratch” to emphasize this reduced (but still important) reliance on a
priori NLP knowledge.
5 INTRODUCTION
Software testing is a critical element of software quality assurance and represents the
ultimate review of specification, design and coding. In fact, testing is the one step in the software
engineering process that could be viewed as destructive rather than constructive.
A strategy for software testing integrates software test case design methods into a
well-planned series of steps that result in the successful construction of software. Testing is
the set of activities that can be planned in advance and conducted systematically. The
underlying motivation of program testing is to affirm software quality with methods that can
economically and effectively apply to both strategic to both large and small-scale systems.
8.2. STRATEGIC APPROACH TO SOFTWARE TESTING
The software engineering process can be viewed as a spiral. Initially system
engineering defines the role of software and leads to software requirement analysis where the
information domain, functions, behavior, performance, constraints and validation criteria for
software are established. Moving inward along the spiral, we come to design and finally to
coding. To develop computer software we spiral in along streamlines that decrease the level
of abstraction on each turn.
A strategy for software testing may also be viewed in the context of the spiral. Unit
testing begins at the vertex of the spiral and concentrates on each unit of the software as
implemented in source code. Testing will progress by moving outward along the spiral to
integration testing, where the focus is on the design and the construction of the software
architecture. Talking another turn on outward on the spiral we encounter validation testing
where requirements established as part of software requirements analysis are validated
against the software that has been constructed. Finally we arrive at system testing, where the
software and other system elements are tested as a whole.
MODULE TESTING
UNIT TESTING
MODULE TESTING
8.3. UNIT TESTING
Unit testing focuses verification effort on the smallest unit of software design, the module. The unit
testing we have is white box oriented and some modules the steps are conducted in parallel.
1. WHITE BOX TESTING
This type of testing ensures that
All independent paths have been exercised at least once
All logical decisions have been exercised on their true and false sides
All loops are executed at their boundaries and within their operational bounds
All internal data structures have been exercised to assure their validity.
To follow the concept of white box testing we have tested each form .we have created
independently to verify that Data flow is correct, All conditions are exercised to check their
validity, All loops are executed on their boundaries.
2. BASIC PATH TESTING
Established technique of flow graph with Cyclomatic complexity was used to derive test cases for all the functions. The main steps in deriving test cases were:
Use the design of the code and draw correspondent flow graph.
Determine the Cyclomatic complexity of resultant flow graph, using formula:
V(G)=E-N+2 or
V(G)=P+1 or
V(G)=Number Of Regions
Where V(G) is Cyclomatic complexity,
E is the number of edges,
N is the number of flow graph nodes,
P is the number of predicate nodes.
Determine the basis of set of linearly independent paths.
3. CONDITIONAL TESTING
In this part of the testing each of the conditions were tested to both true and false aspects. And all the resulting paths were tested. So that each path that may be generate on particular condition is traced to uncover any possible errors.
4. DATA FLOW TESTING
This type of testing selects the path of the program according to the location of definition and
use of variables. This kind of testing was used only when some local variable were declared.
The definition-use chain method was used in this type of testing. These were particularly
useful in nested statements.
5. LOOP TESTING
In this type of testing all the loops are tested to all the limits possible. The following exercise was adopted for all loops:
All the loops were tested at their limits, just above them and just below them.
All the loops were skipped at least once.
For nested loops test the inner most loop first and then work outwards.
For concatenated loops the values of dependent loops were set with the help of connected loop.
Unstructured loops were resolved into nested loops or concatenated loops and tested as above.
Each unit has been separately tested by the development team itself and all the input have been validated.
System Security
The protection of computer based resources that includes hardware, software, data, procedures and people against unauthorized use or natural
Disaster is known as System Security.
System Security can be divided into four related issues:
Security Integrity Privacy Confidentiality
SYSTEM SECURITY refers to the technical innovations and procedures applied to the hardware and operation systems to protect against deliberate or accidental damage from a defined threat.
DATA SECURITY is the protection of data from loss, disclosure, modification and destruction.
SYSTEM INTEGRITY refers to the power functioning of hardware and programs, appropriate physical security and safety against external threats such as eavesdropping and wiretapping.
PRIVACY defines the rights of the user or organizations to determine what information they are willing to share with or accept from others and how the organization can be protected against unwelcome, unfair or excessive dissemination of information about it.
CONFIDENTIALITY is a special status given to sensitive information in a database to minimize the possible invasion of privacy. It is an attribute of information that characterizes its need for protection.
9.3 SECURITY SOFTWARE
System security refers to various validations on data in form of checks and controls to avoid the system from failing. It is always important to ensure that only valid data is entered and only valid operations are performed on the system. The system employees two types of checks and controls:
CLIENT SIDE VALIDATION
Various client side validations are used to ensure on the client side that only valid data is entered. Client side validation saves server time and load to handle invalid data. Some checks imposed are:
VBScript in used to ensure those required fields are filled with suitable data only. Maximum lengths of the fields of the forms are appropriately defined.
Forms cannot be submitted without filling up the mandatory data so that manual mistakes of submitting empty fields that are mandatory can be sorted out at the client side to save the server time and load.
Tab-indexes are set according to the need and taking into account the ease of user while working with the system.
SERVER SIDE VALIDATION
Some checks cannot be applied at client side. Server side checks are necessary to save the system from failing and intimating the user that some invalid operation has been performed or the performed operation is restricted. Some of the server side checks imposed is:
Server side constraint has been imposed to check for the validity of primary key and foreign key. A primary key value cannot be duplicated. Any attempt to duplicate the primary value results into a message intimating the user about those values through the forms using foreign key can be updated only of the existing foreign key values.
User is intimating through appropriate messages about the successful operations or exceptions occurring at server side.
Various Access Control Mechanisms have been built so that one user may not agitate upon another. Access permissions to various types of users are controlled according to the organizational structure. Only permitted users can log on to the system and can have access according to their category. User- name, passwords and permissions are controlled o the server side.
Using server side validation, constraints on several restricted operations are imposed.
We use two orthogonal methods to utilize automatically detected human
attributes to significantly improve content-based face image retrieval. Attribute-enhanced
sparse coding exploits the global structure and uses several human attributes to construct
semantic-aware code words in the offline stage. Attribute-embedded inverted indexing
further considers the local attribute signature of the query image and still ensures efficient
retrieval in the online stage. The experimental results show that using the code words
generated by the proposed coding scheme, we can reduce the quantization error and
achieve salient gains in face retrieval on two public datasets; the proposed indexing scheme
can be easily integrated into inverted index, thus maintaining a scalable framework. During
the experiments, we also discover certain informative attributes for face retrieval across
different datasets and these attributes are also promising for other applications. Current
methods treat all attributes as equal. We will investigate methods to dynamically decide the
importance of the attributes and further exploit the contextual relationships between them.