Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Facoltà di Lettere e Filosofia Corso di Laurea Specialistica
in Comunicazione d’Impresa e Pubblica
Tesi di Laurea in
Informatica per il Commercio Elettronico
Applying semantically enhanced web mining techniques for building a domain ontology
Supervisors Candidata Prof. Ernesto D’Avanzo Federica Marano matr. 0320/400100 Prof. Annibale Elia Prof. Tsvi Kuflik
Anno accademico 2007-2008
2
Acknowledgments
First of all, I tank Ernesto D’Avanzo e Tsvi Kuflik, that followed my thesis work
taking care of even the smallest detail.
I tank, especially, Annibale Elia and Emilio D’Agostino, I spent, thanks to them,
my best moments of study during these years.
Tank to Brenda Shaffer, an energy domain expert, that helped us to develop this
project.
And last but not least, I tank all those that I have always maintained, close or
distant, present or absent.
3
Abstract
This work focused on applying semantically enhanced web mining techniques for
building a domain ontology. We mainly analyzed ontology population problem,
because an ontology, to be useful, needs continuously to be updated with new
concepts, relations and lexical resources. There are many methods and several
techniques to do it, but we choose linguistically motivated text mining because
semantic and linguistic information are very important to build an ontology.
According to this point of view we consider an ontology such a lexical taxonomy
created starting by a corpus. In Chapter 2 we selected most relevant works in
literature about linguistically motivated text Mining and knowledge lexical
acquisition methodologies for ontology population, distinguishing lexical
acquisition techniques for automatic or semiautomatic ontology population and
lexical acquisition for ontology population using Keyphrase Extraction (KE)
methodologies. In Chapter 3 we discussed our case study on Energy domain, first
focusing on LAKE (Linguistic Analysis based Knowledge Extractor), a keyphrase
extraction system based on a supervised learning approach that makes use of the
linguistic processing of the documents, we used this KE algorithms for our
experiments, then focusing on lexical acquisition fundaments, reporting, at last,
experiments and results about Energy Domain Ontology. In Chapter 4 concluded
the work with some remarks and proposing future works such Local Grammar
Induction very useful to describe in which manner we use a language, specially in a
specific domain and based on a combination of lexical resources and Part of
Speech Patterns.
Ontology population is a serious problem that we solved using two main
approaches, manual and automatic. First we retrieved 200 documents on the web
using search engines and crawlers building manually the most important part of the
4
energy domain ontology. We compared three different crawlers: Infospiders, Best
First and SS Spider focused on Energy Security Class using queries made with
“class concept + subclass concepts”, (for instance “Energy Security + Reliability of
Supply”); the best one was Infospiders, its results of precision and recall was the
highest. In a successive step we red the most part of documents to manually
building the energy domain ontology. Then we considered many automatic or
semi-automatic techniques for populate the ontology, but in our experiment we
used Lexical Acquisition training LAKE, a keyphrases extraction algorithm, with
25 linguistically annotated documents. At last we compared manual versus
automatic ontology building and the results were quite encouraging, calculating the
average of two measures, precision and recall, respectively of 56% and 40%.
Obviously a manual building of an ontology is more accurate because the human is
an expert lexicographer, but it’s more time consuming and expensive, so an
automatic or semi-automatic approach can be very useful in some steps. In a future
work we will enlarge the corpus for the experiment considering 200 and much
more documents. Successively we want to generate a Local Grammar for energy
domain ontology. It’s a bottom up approach and could help us to describe in which
manner we use language in a specific domain. A statement for a Local Grammar is
composed by lexical resources and part of speech pattern, we can build not only a
controlled vocabulary but also we can better describe local syntactic constraints.
Keywords: energy domain ontology, ontology population, linguistically motivated
text mining, lexical acquisition, Keyphrase Extraction, local grammar.
5
Contents
INTRODUCTION ...................................................................................................... 6
1.1 THE PROBLEM DEFINITION AND THE RESEARCH QUESTION ........................ 11
CHAPTER 2 ............................................................................................................. 13
LITERATURE SURVEY ON LINGUISTICALLY MOTIVATED TEXT MINING AND KNOWLEDGE LEXICAL ACQUISITION METHODOLOGIES FOR ONTOLOGY POPULATION ............. .................... 13
2.1 LEXICAL ACQUISITION FOR AUTOMATIC ONTOLOGY POPULATION ................ 13 2.1.1 Automatic Ontology Population from web documents ........................... 13 2.1.2 Automatic Ontology Population by Googling ........................................ 16
2.2 LEXICAL ACQUISITION FOR SEMI-AUTOMATIC ONTOLOGY POPULATION ....... 18 2.3 LEXICAL ACQUISITION FOR ONTOLOGY POPULATION USING KEYPHRASE
EXTRACTION METHODOLOGY ............................................................................... 22 2.3.1 Keyphrase Extraction Algorithms ........................................................... 23 2.3.2 Relevant Keyphrase Extraction Algorithms ............................................ 30
2.4. LINGUISTIC AND STATISTIC PHRASES: SOME REMARKS ............................. 48
CHAPTER 3 ............................................................................................................. 52
THE ENERGY DOMAIN CASE STUDY ............................................................. 52
3.1 LEXICAL ACQUISITION FOR ONTOLOGY POPULATION ..................................... 61 3.1.1 Ontological organization ........................................................................ 61 3.1.2 Parsing with Case Grammar ................................................................ 64 3.1.3 Lexical Acquisition Fundaments ........................................................... 65 3.1.4 Lexical Acquisition Evaluation ............................................................. 72 3.1.5 The role of lexical acquisition in statistical NLP ................................. 73
3.2 EXPERIMENTS WITH ENERGY DOMAIN ONTOLOGY .................................... 73 3.2.1 Methods and tools ................................................................................. 73 3.2.2 Results ................................................................................................... 79
3.3 DISCUSSION ............................................................................................... 81
CHAPTER 4 ............................................................................................................. 82
CONCLUSIONS ...................................................................................................... 82
4.1 FUTURE WORKS ........................................................................................ 82
APPENDIX ............................................................................................................... 84
KEYPHRASES MANUALLY EXTRACTED ...................................................... 84
BIBLIOGRAPHY .................................................................................................... 96
6
Chapter 1
Introduction
In this pilot study we worked on building domain ontology for the energy domain.
But what is an ontology? It is a philosophic term that concerns being or existence
and its basic categories and relationships, to determine what entities and what types
of entities exist.
In this work, we will treat ontologies as they are used in Computer Science.
Referring to Artificial Science, Semantic Web and Knowledge Representation, we
can find different definitions, for example:
• An ontology is a document or file that formally defines the relations among
terms. The most typical kind of ontology for the Web has a taxonomy and a
set of inference rules (Berners-Lee, 1991).
• Gruber define an ontology such an explicit specification of a
conceptualization. A conceptualization is an abstract, simplified view of the
world that we wish to represent for some purpose (Gruber, 1993).
Web Ontology Language (OWL) has following elements:
• Individuals/Instances
Member of a class. The “ground level” components of an ontology, include
concrete objects such as people, animals, things, etc., or abstract objects
such as numbers and words.
• Classes
Categories created building the ontology. They are abstract groups or
collections of concepts and objects, may contain individuals, other classes
7
or both of them. For instance in our case “energy source” is a class
containing as individuals “oil”, “gas”, etc..
• Relationships/Properties
Binary relations between individuals or classes. They describe a sort of link
between objects in the ontology. The main relation is the subsumption one
expressing an IS A relation, for instance “oil” IS AN “energy source”, but
there are several type of relation. They are three types:
1. Object: to link an individual with a class or with another instance.
2. Datatype: to create an external link with a file XML or an RDF
Schema.
3. Annotation: to add generic information
This formalized knowledge is processable automatically by a machine trough a
reasoner that implements inferential and deductive process.
For many applications, an ontology plays a role in a larger system that can be
designed much like any other complex information system, beginning with a
context of use and requirements analysis. However, for many applications, the
ontology plays the role of mediator for reusable knowledge assets. In such a
situation, the value added by the ontology is only as good as its ability to organize
materials that have not yet been encountered (Allemang, 2006).
Many technologies that now go by the name of "Ontologies" promise a new
paradigm of semantic information sharing. The grandest vision of this is the
Semantic Web, in which the enormous body of data available on the web will be
organized in a way that allows it to be indexed by its meaning, not just by its form
(Allemang, 2006).
8
An ontology is a valid support to decision making process if we consider, for
example, the information need of a web user. Nowadays the WWW is a very
important tool used for working, studying, communicating and for many other
interests. It contains documents, images, and other multimedia resources about
every topic, everything is immediately available on line for everyone having an
Internet connection.
Considering that there are about one milliard of web pages it’s statistically
impossible to retrieve the information in which we are interested at first attempt.
Why can’t we retrieve the correct information on the Web?
It’s simple! Because, nowadays, information on the web are retrieved using search
engines that consider all them equal, since information are created only for human
consumption and aren’t machine readable.
Tim Berners-Lee proposed Semantic Web as solution (Berners-Lee, 2001) a sort
of Web machine readable that uses intelligent agents in order to guide the user to
specific desired information and to help him carrying out some operations
automatically.
Semantic Web uses schemes to describe domain information and so every domain
needs to be described by specific scheme in which meta data map these
information relation to classes or concepts belonging to the domain.
In this way we obtain data structure able to link information each other. Semantic
Web is composed by three levels:
1. Data
Information that describe everything.
9
2. Meta data
Map data representing a concept in to a schema.
3. Ontology
Describes relations between concepts using data classes.
When a user searches information using search engines he sets some queries based
on keywords in order to get relevant information. These keywords may be in fact
concepts and relations in a domain ontology and Semantic Web uses them in order
to create a controlled vocabulary not ambiguous and machine readable.
Ontologies give an explicit conceptualization in order to describe semantic data
using languages syntactically and semantically more rich than a common database,
obtaining not only a domain knowledge representation, but also a specific point of
view about the domain.
Ontologies are created in respect to some main principles:
• Exportability
The system must be independent from application and must be exportable
in other domain.
• Interoperability
It needs to share a common representation of information within a group
using a module structure.
• Semantic Interoperability
In this way, systems, that use different knowledge representation, can
communicate using the ontology.
• Modeling
To explicit inferences of knowledge about a specific domain.
10
The use of ontologies in a Semantic Web view is according to Usability Web
Theories that study the better approach to web pages when users surf on the Web.
Surfing on the Web implies rapid and free eyes movement, pointing to its
importance among designers and users alike. It has also been long established
(Brambring, 1984, Chieko et al. 1998) that this potentially complex and difficult
movement is further complicated, and becomes neither rapid nor free, if the user is
visually impaired.
Harper and Patel (2005) confirmed that people use “Gist” summaries as support to
decision making within their browsing behavior.
Harper and Patel (2005) have investigated four simple summarization algorithms
against each other and a manually created summary; producing empirical evidence
as a formative evaluation. This evaluation concludes that users prefer simple
automatically generated “gist” summaries thereby reducing cognitive overload and
increasing awareness of the focus of the Web-page under investigation.
Users do not read Web pages, they scan them and so, summaries can be important
elements of Web pages to facilitate scanning and browsing. Hence Harper and
Patel (2005) focused on the development of FireFox based tool which creates
summaries of Web pages.
Our work, instead, is based on Keyphrases (also called keywords) Extraction
approach, very similar to summarization technique and useful to succinctly
summarize and characterize documents, providing semantic metadata as well.
11
1.1 The problem definition and the research question
The main problem concerning ontology building is ontology population, because
an ontology, to be useful, needs continuously to be updated with new concepts,
relations and lexical resources. To add new information within the ontology it
needs to retrieve that information in new documents taken, for example, on the
web. But it takes a lot of time, for human, to read every document in order to
extract new information. This work focused on applying semantically enhanced
web mining techniques for semi-automatically building a domain ontology. There
are many methods and several techniques to do it, but we choose linguistically
motivated text mining because semantic and linguistic information are very
important to build an ontology. According to this point of view we consider an
ontology such a lexical taxonomy created starting by a corpus. Using linguistic and
semantic information in an ontology building allow us to create a more complete
structure useful in many applications. For instance we could create an “intelligent”
search engine dedicated to energy domain in which users could retrieve
information using phrases in natural language and not only some more relevant
words as keywords. In this way we could train this search engine to answer in
natural language and to return results using linguistic information.
In Chapter 2 we selected most relevant works in literature about linguistically
motivated text Mining and knowledge lexical acquisition methodologies for
ontology population, distinguishing lexical acquisition techniques for automatic or
semiautomatic ontology population and lexical acquisition for ontology population
using Keyphrase Extraction (KE) methodologies.
12
In Chapter 3 we discussed our case study on Energy domain, first focusing on
LAKE (Linguistic Analysis based Knowledge Extractor), a keyphrase extraction
system based on a supervised learning approach that makes use of the linguistic
processing of the documents, we used this KE algorithms for our experiments, then
focusing on lexical acquisition fundaments, reporting, at last, experiments and
results about Energy Domain Ontology.
In Chapter 4 concluded the work with some remarks and proposing future works
such Local Grammar Induction very useful to describe in which manner we use a
language, specially in a specific domain and based on a combination of lexical
resources and Part of Speech Patterns.
13
Chapter 2
Literature Survey on linguistically motivated text mining
and knowledge lexical acquisition methodologies for
ontology population
To be a useful tool, an ontology needs to be continuously updated with new
concepts, relations and lexical resources. In the following pages we will show
several methods and techniques used to populate a domain ontology. We will
mainly focus on approaches relevant to our experiments on lexical acquisition and
linguistically motivated text mining. For the former, this includes developing
algorithms and statistical techniques for filling the holes in existing machine
readable dictionaries by looking at the occurrence patterns or words in large text
corpora. For the later this includes, survey linguistic approaches to text mining
using more linguistic and semantic information beneficial to populate an ontology.
2.1 Lexical Acquisition for Automatic Ontology Population
2.1.1 Automatic Ontology Population from web documents
(Alani et al., 2003) developed Artequakt a system that automatically extracts
knowledge about artists from the Web, populates a knowledge base, and uses it to
generate personalized biographies. Artequakt connects a knowledge-extraction tool
with an ontology to get knowledge support from the information extracted. The
extraction tool searches for documents on the web and extracts knowledge, then
compares those results with the given classification structure. This knowledge is
14
converted in a machine-readable format that will be automatically stored in a
knowledge base (KB). A lexicon-based term expansion mechanism is used to
increase knowledge extraction and gives extended ontology terminology.
Artequakt’s architecture (figure 1) includes three areas. First, knowledge extraction
tools assemble information items along with sentences and paragraphs from Web
documents selected manually or obtained automatically by a search engine. The
tools deliver the information fragments to the ontology server along with metadata
derived from the vocabulary. To identify knowledge fragments Artequakt uses
WordNet (www.cogsci.princeton.edu/~wn), a general-purpose lexical database,
and GATE (General Architecture for Text Engineering, http://gate.ac.uk), an entity
recognizer. These tools allow Artequakt to obtain knowledge consisting of not just
entities but also the relations between them.
Figure 1. The Artequakt architecture.
(Alani et al., 2003)
15
The Knowledge Extraction (KE) module automatically populates the ontology with
information extracted from online documents on the basis of the given ontology’s
representations and WordNet lexicons. Information Extraction tools can recognize
entities in documents like “Rembrandt”, a person, or “15 July 1606”, a date.
However, such information isn’t very useful without the relation between these
entities that is, “Rembrandt was born on 15 July 1606”. Extracting such relations
automatically allowed them acquiring more complete knowledge to populate the
ontology. Artequakt attempts to identify entity relationships using ontology relation
declarations and lexical information.
Second, the ontology server stores and consolidates the information so that the
biography generation tool can query the KB using an inference engine. Storing
information in a structured KB supports different knowledge services as that
reconstructing the original source material to produce a dynamic presentation
tailored to user needs. After these steps the system has to populate ontologies with
many high-quality instantiations in order to get valuable and consistent ontology
based knowledge services, adding information into the KB following the ontology
domain representation.
Third, the Artequakt server takes user requests to generate narratives through a
simple web interface. The user might request a particular style of biography, such
as a chronology or summary, or a specific focus such as the artist’s style or body of
work. The server uses story templates to render a narrative from the KB.
The ontology server used for this project uses Java sockets and connects to the
Artequakt KB through the Protégé application programming interface. An
inference engine built on this server allows querying the KB to retrieve specific
16
information. Artequakt’s ontology server sends some extracted knowledge to a
relational database for quick access to frequently used information through SQL
queries when generating biographies.
2.1.2 Automatic Ontology Population by Googling1
Automatic ontology population using Google.com search engine is a method,
inspired by Hearst (1992), to populate ontologies with the use of googled text
fragments (Gijs Geleijnse and Jan Korst, 2005). Consisting of patterns that express
arbitrary relations between instances of classes.
Hearst patterns (Hearst, 1992) are lexico-syntactic patterns that indicate the
existence of class/subclass relation in unstructured data source, e.g. web pages. In
according to Hearst, given two terms t1 and t2 we are able to record how many
times some of these patterns indicate an “is-a” relation between them, identifying
hypernyms-hyponyms relations.
Although these patterns occur quite rarely in unstructured data, they provide
reliable and valuable information.
The method is based on hand-crafted patterns which are tailor-made for the classes
and relations considered. These patterns are queried to Google.com, where the
results are scanned for new instances. Instances found can be used within these
patterns as well, so the algorithm can populate an ontology based on a few
instances in a given partial ontology. The algorithm they describe uses instances of
some classes returned by Google to find instances of other classes. For each
1 Googling is a neologism that means to search information on the web using Google.com like search engine.
17
pattern, which represent a binary relation between two objects A and B: A-relation-
B, they can formulate two Google queries: A-relation and relation-B.
Because Google.com retrieves web pages, not the information itself, the authors to
developed a method to automatically extract information from the web using a
wrapper algorithm that crawls a large website and makes use of the homogeneous
presentation of the information on its pages. When instances are denoted on exactly
the same place on each similar page within a website, it is easy to extract them and
to update the ontology.
Gijs Geleijnse and Jan Korst (2005) identify instances by Google’s support. After
extracting a term, they check in order to find out whether the extracted term is
really an instance of the class. They search phrases in Google search engine that
express the term-class relation. Again, these phrases can be constructed
semiautomatically. Hearst-patterns are candidates as well for this purpose. A term
is accepted as instance when the number of hits of the queried phrase is upon a
given threshold.
To test their system Gijs Geleijnse and Jan Korst (2005) have selected a small
partial ontology of the movies domain. They identified three classes, of which only
the class Director has instances. In doing so they found movies directed by these
directors. The movies found were used to find starring actors that represent the
basis of the search for other movies in which they played. The process continues
until no new instances can be found.
With a starting set of only two directors, they have found a well-populated
ontology with not only directors, but also actors and movie titles.
18
2.2 Lexical Acquisition for Semi-Automatic Ontology
Population
Maria Vargas-Vera and David Celjuska (2004) describe a system for semi-
automatic population of ontologies that extracts instances from unstructured text. It
learns extraction rules from annotated text and then applies them on new articles to
populate the ontology.
The work aims to accomplish three purposes:
1. identifying key entities in text articles that could participate in ontology
population with instances. The system identifies important entities in a
document, put them as values in a slot v1, v2, . . . , vNi, and thus constructs
an instance composed of these features Ii = (v1, . . . , vNi) in the given
ontology O. For example in the class “Conferring an Award” some possible
slots of values are “has a duration”, “has a location”, “has an awarding body
(an organization)”, etc.;
2. identifying of the most probable classes for the population based on newly
introduced confidence values;
3. semi-automatically populating an ontology with those instances.
The system consists of the following 3 steps:
1. Texts Annotation
The system, using a supervised learning, is trained to learn extraction rules with a
set of examples, consisting of a set of documents annotated with XML tags and
assigned to one of the predefined classes within the ontology O. Each slot within
the ontology is assigned a unique XML tag – the mark-up step is ontology driven.
19
Once the user identifies a desired class for a displayed document from the ontology
he is only offered relevant tags for the annotation.
2. System Learning divided in turn into 3 steps
• Natural Language Processing (NLP)
Ontosophie uses shallow parsing to recognize syntactic constructs without
generating a complete parse tree for each sentence. The shallow parsing has the
advantages of higher speed and robustness. High speed is necessary to apply the
Information Extraction (IE) to a large volume of documents. The robustness
achieved by using a shallow parsing is essential to deal with unstructured texts. In
particular, Ontosophie uses the Marmot2, an NLP system that accepts ASCII files
and produces an intermediate level of text analysis that is useful for IE
applications. Sentences are separated and segmented into noun phrases, verb
phrases and other high-level constituents.
After each document has been annotated and pre-processed with the NLP tool, the
generating extraction rules takes.
• Generating extraction rules
This phase makes use of Crystal3, a conceptual dictionary induction system
(Soderland at al., 1995).
Crystal derives a dictionary of concept nodes and extraction rules from a training
corpus based on the specific-to-general algorithm. For instance, an extraction rule
2 Marmot was developed at University of Massachusetts, MA, USA.
3 Crystal was developed at University of Massachusetts, MA, USA.
20
might be understood as following: conferring-an-award: <VB-P “been awarded”>
<OBJ1 ANY> <PP “by” has-awarding-body> Coverage: 5 Error: 1.
The selected rule’s purpose is to extract conferring-an-award, which refers to the
name of a class in the ontology. This extraction rule is aimed at extracting has-
awarding-body – name of a donor.
The rule fires only if all the constraints are satisfied. This means, that the entity
conferring-an-award is extracted from any sentence or its part only in the case
where it consists of “has been awarded” as passive verb (VB-P), an object (OBJ1)
that might be anything and it contains a prepositional phrase (PP), which starts with
preposition “by”. When this rule fires then the prepositional phase (PP) is extracted
as has-awarding-body. In addition, Crystal provides two values: Coverage and
Error. In this particular example they state that the rule covers five instances (one
incorrectly) in the corpus in which the rule was generated from.
• Assigning rule confidence values to extracted rules
Another step is assigning rule confidence values to extracted rules.
Experimentation showed that some extraction rules that were learnt by Crystal are
very weak and therefore firing too often, while other rules might be overly specific.
In addition, previous experiments (Riloff, 1996) showed that precision improves if
those rules are manually removed. However, their goal was to take an automatic
control over this and to eliminate rules with low rule confidence. For this purpose
Ontosophie attaches a rule confidence value to each rule. The rule confidence
expresses how sure the system is about the rule itself.
21
Ontosophie has two ways of computing the rule confidence value. The first method
uses the Coverage and Error values provided by Crystal. The rule confidence for
the rule r i is computed as Cri = cri/nri = Coverageri−Error ri / Coverageri. Where cri is
the number of times the rule r i has fired correctly and nri is the number of times the
rule is fired in total. However, this does not distinguish between, for example C2 =
(2 − 0)/2 and C10 = (10 − 0)/10, because C2 = C10 = 1.0. But C10 is more accurate
and has higher support, because in this case the rule fired ten times out of ten
correctly, while the other fired only two times correctly out of two. This is why
Ontosophie uses Laplace Expected Error Estimate (Clark and Boswell, 1991)
defined as 1, LaplaceAccuracy, where LaplaceAccuracy = nc+1/ntot+k and nc is the
number of examples in the predicted class covered by the rule, ntot is the total
number of examples covered by the rule and k is the number of classes in the
domain. Implementing the Laplace accuracy the valuation of confidence is then Cri
= cri+1/nri+2 and k = 2 because it deals with two classes for each rule. One, the
rule fires and two, the rule does not fire.
One might note that if Cri = 0.5 the rule fires correctly as often as it does
incorrectly and so it should be eliminated.
The second method computes confidence for each rule by the k-Fold Cross
validation methodology (Mitchell, 1997) on the training set. At each run a new set
of extraction rules is generated by Crystal. This algorithm (Celjuska, 2004)
computes for each rule r i how many times it fired correctly cri , how many times it
fired in total nri, performs merging of identical rules and assigns xri to each rule that
tells how many times the rule was merged. At each run, after all the rules have
22
been generated by Crystal, Ontosophie enters evaluation state which is based on the
extraction in order to recognize whether an extracted entity is correct or not.
3. Extraction and ontology population
The system extracts appropriate entities from an article and feed a newly
constructed instances in order to populate the ontology. The extraction is run class
by class. Firstly, a set of extraction rules for only one specific class from the
ontology is taken and only those rules are used for the extraction. The step is then
repeated for all the classes within the ontology and thus for each class the system
gets a set of entities.
It might happen that the extraction component extracts more than one value for a
given slot name. This is the collision that has to be solved.
If more than one rule extracts the same entity, then Cvalue is computed as the
maximum overall confidences of rules that fired it. The same applies for the slot
confidence Cslot.
However, if more then one value was extracted for a slot then only the value with
its highest confidence is considered and also pre-selected.
2.3 Lexical Acquisition for Ontology Population using
Keyphrase Extraction methodology
An important approach useful to do lexical acquisition and populate an ontology is
Keyphrase Extraction (KE) (D’Avanzo, 2005). A Keyphrase is a “textual unit
usually larger than a word but smaller than a full sentence” (Caropreso, 2002).
From an operative perspective, keyphrases represent a useful way to succinctly
23
summarize and characterize documents, providing semantic metadata as well. The
term “syntactic phrase”, instead, denotes any phrase that is so according to the
grammar of the language under consideration. A “statistical phrase is any sequence
of words that occurs contiguously in a text” (Caropreso, 2002). In the following,
different kinds of keyphrases are analyzed, depending on the approach proposed
they range from statistical based keyphrases, as those used by Turney’s (2002) and
Witten’s (1999), to more linguistically based keyphrases as those proposed by
Hulth (2003). They do not only work as brief summaries of a document’s contents
as Turney pointed out (Turney, 1999), but they can be used in information retrieval
systems “as descriptions of the document returned by a query, as the basis for
search indexes, as a way of browsing a collection, and as a document clustering
technique” (Turney, 1997). Even though keyphrases have a great diffusion in the
context of journal articles, many other types of documents could benefit from the
use of keyphrases, including Web pages, email messages, news reports, magazine
articles, and business papers. Although the potential benefit of keyphrases is large,
the vast majority of documents are not currently furnished with keyphrases due to
impracticality of assigning them manually.
2.3.1 Keyphrase Extraction Algorithms
A common approach used by the systems below concerns the pre-processing part
(Turney, 1997, Jones et al. 2002). This step is directly related to the choice of
potential keyphrases and consists of an input cleaning, a phrase identification, and
a stemming (Salton, 1988).
24
Pre-Processing Stage
It consists of different steps, that are:
• Input cleaning
• Phrase identification
• Case-folding and stemming
Input cleaning
The input stream (usually an ASCII file) is split into tokens (sequences of letters,
digits and internal periods), and then several modifications are made:
� punctuation marks, brackets, and numbers are replaced by phrase
boundaries;
� apostrophes are removed;
� hyphenated words are split in two;
� remaining non-token characters are deleted, as are any tokens that do not
contain letters.
The result is a set of lines, each a sequence of tokens containing at least one letter.
Part I – Phrase identification
In this stage a set of heuristic founded rules identify the phrases. The rule used by
the algorithms below are:
� Candidate phrases are limited to a certain maximum length (usually three
words).
� Candidate phrases cannot be proper names (i.e. single words that only ever
appear with an initial capital).
� Candidate phrases cannot begin or end with a stopword.
25
The stopword list contains 425 words in nine syntactic classes (conjunctions,
articles, particles, prepositions, pronouns, anomalous verbs, adjectives, and
adverbs). For most of these classes, all the words listed in an on-line dictionary
were added to the list. All contiguous sequences of words in each input line are
tested using the three rules above, yielding a set of candidate phrases.
Case-folding and stemming
An important step in determining candidate phrases is to case-fold all words and
stem them using the iterated Lovins method. This involves using the classic Lovins
stemmer (Lovins, 1968) to discard any suffix, and repeating the process on the
stem that remains until there is no further change. So, for example, the phrase cut
elimination becomes cut elim.
Stemming and case-folding allow us to treat different variations on a phrase as the
same thing. For example, proof net and proof nets are essentially the same, but
without stemming they would have to be treated as different phrases.
Part II – Candidate Selection
Machine Learning (ML) is an area of research, spawning a number of different
problems and algorithms for their solutions. Algorithms vary in their goals, in the
available training data sets, and in the learning strategies and representation of data.
All of these algorithms, however, learn by searching through an n-dimensional
space of a given data set to find an acceptable generalization (Mitchell, 1997,
Quinlan, 1986).
26
The most fundamental machine-learning tasks is inductive machine learning where
a generalization is obtained from a set of samples, and it is formalized using
different techniques and models. Inductive learning can be defined as “the process
of estimating an unknown input-output dependency or structure of a system, using
limited number of observations or measurements of inputs and outputs of the
system” (Kantardzic, 2003).
An inductive-learning machine tries to form generalization from particular, true
facts, known as the training data set. These generalizations are formalized as a set
of functions that approximate the system’s behavior, requiring a priori knowledge
in addition to data. All inductive-learning methods use a priori knowledge in the
form of the selected class of approximating functions of a learning machine
(Kantardzic, 2003, Mitchell, 1997).
In general, the learning machine is able to implement a set of functions f (X,w),
with w W, where X is an input, w is a parameter of the function, and W is a set
of abstract parameters used only to index the set of functions. In this formulation,
the set of functions implemented by the learning machine can be any set of
functions. Ideally, the choice of a set of approximating functions reflects a priori
knowledge, about the system and its unknown dependencies. In practice, because
of the complex and often informal nature of a priori knowledge, specifying such
approximating functions may be, in many cases, difficult or impossible (D’Avanzo,
2005).
There are two common types of the inductive-learning methods known as
• supervised learning (or learning with a teacher)
• unsupervised learning (or learning without a teacher).
27
Supervised learning is used to estimate an unknown dependency from known
input-output samples. Supervised learning assumes the existence of a teacher,
fitness function or some other external method of estimating the proposed model.
The term “supervised” denotes that the output values for training samples are
known (i.e., provided by a “teacher”).
Conceptually speaking, the teacher has knowledge of the environment, with that
knowledge being represented by a set of input-output examples. The environment
with its characteristics and model is, however, unknown to the learning system.
KE algorithms discussed by D’Avanzo (2005) make use of a supervised learning
algorithm.
In the context of the supervised learning, KE task is treated as a classification
problem. In ML terms, this means that it exists a learning function that classifies a
data item into one of several predefined classes. The training stage uses a set of
training documents for which the authors keyphrases are known. For each training
document, candidate phrases are identified and their feature values are calculated.
Each phrase is then marked as a keyphrase or a non-keyphrase using the actual
keyphrases for that document. This binary feature is the class feature used by the
machine learning scheme.
In this approach D’avanzo (2005) used a Naïve Bayes as learning device. It learns
two sets of numeric weights (TF × IDF and first occurrence) from the discretized
feature values, one set applying to positive (is a keyphrase) examples and the other
to negative (is not a keyphrase) instances. In this way every new sample, even
without a known output (the class to which it belongs), may be classified correctly.
28
The Bayesian method provides a principled way to incorporate external
information into the data-analysis process. This process starts with an already given
probability distribution for the analyzed data set. As this distribution is given
before any data is considered, it is called a prior distribution. The new data set
updates this prior distribution into a posterior distribution. The basic tool for this
updating is the Bayes Theorem. The Bayes Theorem is a theoretical model for a
statistical approach to inductive-inferencing classification problems.
TF ×××× IDF
This feature compares the frequency of a phrases use in a particular document with
the frequency of that phrase in general use. General usage is represented by
document frequency, the number of documents containing the phrase in some large
corpus. A phrases document frequency indicates how common it is (and rarer
phrases are more likely to be keyphrases). KEA, for example, is a KE algorithm
and builds a document frequency file for this purpose using a corpus of about 100
documents. Stemmed candidate phrases are generated from all documents in this
corpus.
The document frequency file stores each phrase and a count of the number of
documents in which it appears. With this file in hand, the TF × IDF for phrase P in
document D is:
where:
• freq (P,D) is the number of times P occurs in D
29
• size (D) is the number of words in D
• df (P) is the number of documents containing P in the global corpus
• N is the size of the global corpus
The second term in the equation is the log of the probability that this phrase
appears in any document of the corpus (negated because the probability is less than
one). If the document is not part of the global corpus, df (P) and N are both
incremented by one before the term is evaluated, to simulate its appearance in the
corpus.
First occurrence
First occurrence is calculated as the number of words that precede the phrases first
appearance, divided by the number of words in the document.
The result is a number between 0 and 1 that represents how much of the document
precedes the phrases first appearance.
Discretization
Both features are real numbers and must be converted to nominal data for the
machine-learning scheme. During the training process, a discretization table for
each feature is derived from the training data. This table gives a set of numeric
ranges for each feature, and values are replaced by the range into which the value
falls. Discretization is accomplished using the supervised discretization method
described in Fayyad et al (1993).
30
2.3.2 Relevant Keyphrase Extraction Algorithms
GenEx
Turney (1999) has been the pioneer in using the methodology based on the
supervised learning, with GenEx, an algorithm for KE. GenEx has two
components, the Genitor genetic algorithm and the Extractor keyphrase extraction
algorithm. Genitor main function is the tuning of the features of C4.5 decision tree
algorithm (Quinlan, 1986, 1983).
Extractor takes a document as input and produces a list of keyphrases as output.
Extractor has twelve parameters that determine how it processes the input text. In
GenEx, the parameters of Extractor are tuned by the Genitor genetic algorithm to
maximize performance (fitness) on training data. Genitor is used to tune Extractor,
but Genitor is no longer needed once the training process is over.
When the parameter values are known, Genitor is discarded. Thus the learning
system is called GenEx (Genitor plus Extractor) and the trained system is called
Extractor (GenEx minus Genitor).
The performance is measured by the number of matches between the machine-
generated phrases and the human-generated phrases. In particular a precision
measure is employed (the number of matches divided by the number of machine-
generated keyphrases), using a variety of cut-offs for the number of machine-
generated keyphrases. A human-generated keyphrase matches a machine-generated
keyphrase when they correspond to the same sequence of stems. A stem is what
remains when we remove the suffix from a word. By this definition, neural
networks matches neural network, but it does not match networks. The order in the
sequence is important, so helicopter skiing does not match skiing helicopter.
31
The experiments are based on five different document collections. For each
document, there is a target set of keyphrases, generated by hand. The average
precision obtained by Extractor on the five document collection is 29%. Turney
(1999) supported the usefulness of a human evaluation: “It is not obvious whether a
precision of, say, 29% for five phrases is good or bad. We believe that it is useful
to know that one algorithm has a precision of 29% (for a given corpus and a given
desired number of phrases) while another algorithm has a precision of 15% (for the
same corpus and the same number of phrases), but a precision of 29% has no
significance by itself. What we would really like to know is, what percentage of the
keyphrases generated by GenEx are acceptable to a human reader?”.
To this end, an on-line demonstration of GenEx has been created on the web. The
demonstration allows the user to enter any URL for processing. The software then
downloads the HTML at the given URL and sends it to the Extractor. The
keyphrases are shown to the user, who may then rate each keyphrase as good or
bad. Some or all keyphrases may be left unrated (in Turney, 1999 these are called
no opinion). The number of keyphrases is fixed at seven, to keep the interface
simple. Turney interprets the data as supporting the hypothesis that about 80% of
the keyphrases are acceptable (acceptable meaning not bad).
KEA
Witten et al. (1999) implemented their methodology in KEA, an algorithm for
automatically extracting keyphrases from text. KEA identifies candidate
keyphrases using the pre-processing as described above, calculates feature value
32
for each candidate, and uses a machine learning algorithm to predict which
candidates are good keyphrases.
The Naïve Bayes machine learning scheme first builds a prediction model using
training document with known keyphrases, and then uses the model to find
keyphrases in new documents. Two features are calculated for each candidate
phrase and used in training and extraction. They are: TF × IDF, a measure of
phrase’s frequency in a document compared to its rarity in general use; and first
occcurrence, which is the distance into the document of the phrase’s first
appearance. KEA’s effectiveness has been assessed by counting the keyphrases that
were also chosen by the document’s author, when a fixed number of keyphrases are
extracted (the same measure used by Turney, 1999). The average precision is about
20%.
KPSpotter
KPSpotter is an algorithm implementing the methodology proposed by Song et al.
(2003). The algorithm employs a technique that combine Information Gain, a data
mining measure technique introduced in ID3 algorithm (Quinlan, 1993), after
classical pre-processing has been applied.
In this sense KPSpotter presents some resemblances with Extractor; both
algorithms, in fact, use a learner belonging to the same family, that is, the decision
trees (Quinlan, 1996, 1986, 1993).
The outcomes of both learning and extraction stages performed by KPSpotter are
stored in a XML form. The following two features were selected for training and
extracting keyphrases: TF × IDF, and first occurrence, like KEA. Moreover,
33
KPSpotter is able to process various types of input data such as XML, HTML,
unstructured text data and generate XML as an output. For efficiency and
performance reason, KPSpotter stores candidate keyphrases and its related
information such as frequency and stemmed form into an embedded database
management system. The performance of KPSpotter, like Extractor and KEA, was
measured by comparing the number of matches of keyphrases that the author
assigned. To this end, the same training and test data employed in the KEA’s
assessment have been used. Designers of KPSpotter argue that according to their
preliminary experiments the quality of keyphrases extracted by their algorithm is
equivalent to KEA’s.
Hulth’s Approach
The algorithms analyzed share both the learning approach and the pre-processing
(with negligible differences). All systems used only a “shallow” linguistic analysis
(e.g. tokenization, stemming) (Salton, 1988). In the following D’Avanzo (2005)
discussed the approach proposed by (Hulth, 2003), a keyphrase extraction
algorithm that exploits a supervised learning algorithm improved by embedding
more linguistic knowledge.
In her work (Hulth, 2003) Hulth tested three methodologies to find out the
candidate phrases:
1. n-gram approach: in a manner similar to Turney (2000) and Frank et al.
(1999) all unigrams, bigrams, and trigram were extracted. Thereafter a
stoplist was used where all terms beginning or ending with a stopword were
34
removed. Finally all remaining tokens were stemmed using Porter’s
stemmer.
2. Chunking approach: a partial parser4 was used to select all NP-chunks from
the documents. This choice was motivated by inspecting manually assigned
keywords and observing that the vast majority turn out to be nouns or noun
phrases with adjectives. Hulth (2003) argued that this setting seems to
better to capture the idea of keywords having a certain linguistic property.
3. Pattern Approach: A set of Part Of Speech tag patterns, in total 56, were
defined, and all (part-of-speech tagged) words or sequences of words that
matched any of these were extracted. The patterns were those tag sequences
of the manually assigned keywords, they are present in the training data,
that occurred ten or more times.
Four features have been used in the experiments, they are:
• Within-document frequency.
• Collection frequency.
• Relative position of the first occurrence (the proportion of the document
preceding the first occurrence).
• POS tag or tags assigned to the term by the same partial parser used for
finding the chunks and the tag patterns. When a term consists of several
tokens, the tags are treated like a sequence (e.g. an extracted phrase like
random JJ excitations NNS gets the atomic feature value JJ_NNS).
4 LT CHUNK, available at http://www.ltg.ed.c.uk/software/pos/index.html.
35
The learner used for the experiments is the rule induction5 algorithm, i.e. the model
that is constructed from the given examples, and consists of a set of rules. The
measure used to evaluate the results on the validation set was the F-score, defined
combining the precision and the recall obtained by:
In this study, the main concern is the precision and the recall for the examples to
which the class positive have been assigned, i.e. how many of the suggested
keyphrases are correct (precision), and how many of the manually assigned
keyphrases that are found (recall). As the proportion of correctly suggested
keyphrases is considered equally important as the amount of terms assigned by a
professional indexer that was detected, was assigned the value 1, thus giving both
to precision and recall equal weights.
If we first consider the term selection approaches, extracting NP-chunks gives a
better precision, while extracting all words or sequences of words matching any of
a set of POS tag patterns gives a higher recall compared to extracting n-grams
(D’Avanzo, Elia et al. 2007). Table 1 shows the results. The highest F-score is
obtained by one of the n-gram runs. The largest amount of assigned terms present
in the abstracts are assigned by the pattern approach without the tag feature. As for
when syntactic information is included as a feature (in the form of the POS tag(s)
assigned to the term), it is evident from the results discussed above that linguistic
information is beneficial, in this particular evaluation framework, for assigning an
5 http://www.compumine.com.
36
acceptable number of terms per document, independent of what term selection
strategy is chosen (D’Avanzo, 2005).
Table 1: Results obtained using different approaches. Adapted from Hulth (2003)
LinkIT
Finding potential terms by means of PoS tagger is not a new approach. Evans et al.
(2000) used a linguistically-motivated technique for the recognition and grouping
of simplex noun phrases (SNPs) called LinkIT. The system has two key features:
1. it gathers minimal NPs, i.e. SNPs, as precisely and linguistically defined
and motivated;
2. it applies a refined set of postprocessing rules to these SNPs to link them
within a document.
The identification of SNPs is performed using a finite state machine compiled from
a regular expression grammar, and the process of ranking the candidate significant
topics uses frequency information that is gathered in a single pass through the
document. In evaluating the NP identification component of LinkIT it has been
found that it outperformed other NP chunkers in precision and recall. The system is
currently used in several applications, such as web page characterization and multi-
document summarization.
37
Barker’s Approach
Barker (2000) describes a system for choosing noun phrases from a document as
keyphrases. A noun phrase is chosen based on its length, its frequency and the
frequency of its head noun. Noun phrases are extracted from a text using a base
noun phrase skimmer and an off-the-shelf online dictionary. Experiments involving
human judges revealed the following results: the simple noun phrase-based system
performs roughly as well as a state-of-the-art, corpus-trained keyphrase Extractor;
ratings for individual keyphrases do not necessarily correlate with ratings for sets
of keyphrases for a document; agreement among unbiased judges on the keyphrase
rating task is poor.
LAKE
Linguistic Analysis based Knowledge Extractor (LAKE) (D’Avanzo, 2005) is a
keyphrase extraction system. LAKE is based on a supervised learning approach
that makes use of the linguistic processing of the documents. The system works as
follows: first, a set of linguistically motivated candidate phrases is identified. Then,
a learning device chooses the best phrases. Finally, keyphrases at the top of the
ranking are merged to form a summary, or, in general, an index, depending on the
application. Treating the automatic keyphrase extraction as a supervised machine
learning task means that a classifier is trained by using documents annotated with
known keyphrases. The trained model is subsequently applied to documents for
which no keyphrases are assigned: each defined term from these documents is
classified either as a keyphrase or as a non-keyphrase. Both processes choose a set
of candidate keyphrases (i.e. potential terms) from their input document, and then
38
calculate the values of the features for each candidate. Two important issues are
how to define the potential terms, and which features of these terms are considered
discriminative. Like KEA, it uses Naïve Bayes as the learning algorithm, and TF
× IDF term weighting scheme and the first occurrence of a phrase as features.
Unlike Kea and Extractor, LAKE chooses the candidate phrases using linguistic
knowledge (D’Avanzo, 2004, 2005).
The candidate phrases generated by LAKE are sequences of Part of Speech
containing Multiword expressions and Named Entities. Then they are defined as
“patterns” and stored in a patterns database6; once there, the main work is done by
the learner device. The linguistic database makes LAKE unique in its category.
LAKE has three main components: Linguistic Pre-Processor, Candidate Phrase
Extractor and Candidate Phrase Scorer. LAKE accepts a document as an input7.
The document is processed first by Linguistic Pre-Processor which tags the whole
document, identifying Named Entities and Multiwords as well. Then candidate
phrases are identified based on the pattern database (Candidate Phrase Extractor
module in figure 2). Table 2 contains an example of candidate phrases identified
with this procedure about energy domain.
Table 2: Examples of types of phrases and their patterns
6 Patterns consist of sequences of Part of Speechs containing Named Entities and Mutiwords.
7 The system has been designed with different pre-processing modules allowing to process
different format: txt, html, xml.
39
Type of Phrase
Pattern Example Head unit
Bi-Gram AN NN
affordable energy energy conservation
N (energy) N (conservation)
Tri-Gram NPN ANN VPN APN NCN
demand for energy fossil fuel dependency to invest in alternatives vulnerable to shortages disruptions & vulnerability
N (demand) N (dependency) V (to invest) A (vulnerable) N (distr. – vuln.)
Four-Gram ANPN NPAN ANNN AANN NPNN
wasteful use of resources dependence on foreign oil liquid transportation fuels crisis unstable foreign oil supplier cost of oil dependence
N (use) N (dependence) N (crisis) N (supplier) N (cost)
Fifth-Gram ANPNN economic cost of oil dependence
N (cost)
Sixth-Gram ANPNNN macroeconomic cost of oil market disruptions
N (cost)
Seventh-Gram NPNPDNN speed of changes in the climate system
N (speed)
Tenth-Gram DNPAANPNNN The world’s second largest emitter of greenhouse gas emissions
N (emitter)
Up to now the process is the same for training and extraction stages. In training
stage, however, the system is furnished with annotated document8. Candidate
Phrase Scorer module is equipped with a procedure which looks, for each author
supplied keyphrase, for a candidate phrase that could be matched, identifying
positive and negative examples. The model that come out from this step is, then,
used in the extraction stage. LAKE has been extended for Multi-document
summarization purposes.
Again it has been exploited the KE ability of the system, adding, however, a
sentence extraction module able to extract a 250 word summary from a cluster of
8 For which keyphrases are supplied by the author of the document.
40
documents. The module once extracted keyphrases for each document uses a score
mechanism to select the most representative keyphrases for the whole cluster. Once
identified this list, the module selects the sentences which contains these
keyphrases.
LAKE is based on three main components: the Linguistic Pre-Processor, the
Candidate Phrase Extractor and the Candidate Phrase Scorer. In the following
sections there is a detailed description of the system.
Linguistic Pre-Processor
Every document is analyzed by the Linguistic Pre-Processor in the following three
consecutive steps: Part of speech analysis, Multiword recognition and Named
Entity Recognition.
1. Part of Speech Tagger
Usually linguistics groups the words of a language into classes which show similar
syntactic behavior. These classes represent the parts of speech (POS), known also
as syntactic or grammatical categories. Three important parts of speech are noun
(N), verb (V), and adjective (A). Nouns typically refer to people, animals, concepts
and things. The verb is used to express the action in a sentence. Adjectives describe
properties of nouns. In the following there are two possible tagged sentences
associated with the ambiguous sentence about visiting aunts.
Visiting/ADJ aunts/N-Pl can/AUX be/V-inf a/DET-Indef nuisance/N-Sg
Visiting/V-Prog aunts/N-Pl can/AUX be/V-inf a/DET-Indef nuisance/N-Sg
In the first sentence, “visiting” is an adjective that modifies the subject “aunts”. In
the second sentence, it is a gerund that takes “aunts” as an object.
41
The example shows that words may be assigned multiple POS tags, and the role of
the tagger is to choose the correct one. In the “aunts” example there is not enough
information in the sentence to decide between the two tags.
There are two main approaches to POS tagging: rule-based and stochastic.
A rule-based tagger tries to apply some linguistic knowledge to rule out sequences
of tags that are syntactically incorrect.
A rule, for instance, may be like this:
If an unknown term is preceded by a determiner and followed by a noun, then label
it an adjective.
While some rule-based taggers are entirely hand-coded, others leverage from
training procedures on tagged corpora (Brill, 1994).
On the other hand, stochastic taggers also rely on training data, based, however, on
frequency information or probability to disambiguate tag assignments. In its
simplest version a stochastic tagger, for instance, disambiguates words based on the
probability that a word occurs with a particular tag. This probability is typically
computed from a training set, in which words and tags have already been matched
by hand. Tageers based upon Hidden Markov Models or Maximum Entropy
represent more advance stochastic versions (Charniak, 1993).
42
Figure 2. The system architecture
The POS tagger adopted by LAKE is the TreeTagger, developed at the University
of Stuttgart (Schmid, 1994). The TreeTagger uses a decision tree to obtain reliable
estimates of transition probabilities. It determines the appropriate size of the
context (number of words) which is used to estimate the transition probabilities.
For example, if we have to find the probability of a noun appearing after a
determiner followed by an adjective we find out whether the previous tag is ADJ; if
43
yes, then we go into the “yes” branch and check if the tag previous to this was a
determiner; if “yes” then we get to a probability of this occurrence.
2. Multiwords Recognition
The task of Multiwords Recognition (MR) is strictly related to Automatic Term
Recognition (ATR). MW based on MultiWordNet is a lingustic approach to MR.
However, statistical approaches coming from the areas of collocation extraction
and IR exist. Researchers on MR and ATR seem to agree that multiword terms are
mainly noun phrases, but their opinions differ on the type of noun phrases they
actually extract. Most systems rely on syntactic criteria and do not use any
morphological processes. An exception is Damerau’s work (1993). Justeson et al.
(1995) work on noun phrases, mostly noun compounds, including compound
adjectives and verbs albeit in very small proportions. They use the following
regular expression for the extraction of noun phrases
((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun
They incorporate the preposition of, showing however, that when of is included in
the regular expression, there is a significant drop on precision (this drop is too high
to justify the possible gains on recall). Their system does not allow any term
modification. Daille et al. (1994) also concentrate on noun phrases. Term formation
patterns for base Multi-Word Unit (base-MWU), consist mainly of 2 elements
(nouns, adjectives, verbs or adverbs).
The patterns for English are:
• Adj Noun
• Noun Noun
44
Bourigault (1992) also deals with noun phrases mainly consisting of adjectives and
nouns that can contain prepositions, and hardly any conjugated verbs. He argues
that terminological units obey specific rules of syntactic formation. His system
does not extract only terms. Dagan et al. in (1994) claim that noun phrases
extracted consist of one or more nouns that do not belong to a stoplist.
Frequency of occurrence of the potential multiword term is the most common
statistics used.
In LAKE sequences of words that are considered as single lexical units are detected
in the input document according to their presence in WordNet (Figenbaum et al.,
1963, Pianta et al., 2002). For instance, for energy domain, the sequence fossil fuels
is transformed into the single token fossil fuel and the PoS tag found in WordNet is
assigned to it.
3. Named Entities Recognition
The task of Named Entity Recognition (NER) requires a program to process a text
and identify expressions that refer to people, places, companies, organization,
products, and so forth. Thus the program should not merely identify the boundaries
of a naming expression, but also classify the expression, e.g., so that one knows
that Rome refers to a city and not a person.
NER is a subtask of Information Extraction. Different NER systems were evaluated
as a part of the Sixth Message Understanding Conference in 19959. The target
language was English. The participating systems performed well. However, many
of them used language-specific resources for performing the task and it is unknown
9 http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
45
how they would have performed on another language than English (Palmer et al.,
1997). Since 1995 NER systems have been developed for some European
languages and a few Asian languages. Palmer et al. (1997) and (Cucerzan et al.,
1999) have been two seminal work that have applied one NER system to different
languages.
CoNLL-200210 concerned a language-independent named entity recognition
campaign, where the attention was concentrated on four types of named entities:
persons, locations, organizations and names of miscellaneous entities that do not
belong to the previous three groups.
The participants, who were offered training and test data for at least two languages,
were asked to develop named-entity recognition systems that include a machine
learning component. Among the twelve systems which participated to the
competition many of them used a variety of machine learning technique.
The system of Xavier Carreras et al. (2002), which used AdaBoost applied to
decision trees, outperformed all other systems by a significant margin, both on the
Spanish test data and the Dutch test data. Sixteen systems have participated,
instead, in the CoNLL-2003 shared task. For English, the combined classifier of
Florian et al. (2003) achieved the highest overall F1 rate.
An important feature of the best system that other participants did not use, was the
inclusion of the output of two externally trained named entity recognizers in the
combination process. Florian et al. have also obtained the highest F1 rate for the
German data.
10
CoNLL is the acronym of Conference on Natural Language Learning. See
http://www.cnts.ua.ac.be/conll2002/ for details.
46
For Named Entities recognition D’Avanzo (2005) used LingPipe11, a suite of Java
tools designed to perform linguistic analysis on natural language data. The tool
includes a statistical named-entity detector, a heuristic sentence boundary detector,
and a heuristic within-document co-reference resolution engine. Named entity
extraction models are included for English news and can be trained for other
languages and genres.
LingPipe has been chosen, first of all, for its implementation that yelds it to be
easily integrated in the LAKE system. Besides, training named entity detection is
very fast. For example CoNLL 2002 Spanish, consisting of 370K tokens in a line-
based format, took 23 seconds. The named entity recognizer, using the English
news model, and given an array of tokens, produces the array of tags at 100K
tokens/second.
Head Extraction
The linguistic principle of headedness (Arampatzis et al., 2000) claims that any
phrase has a single word as a head. The head is the main verb phrases, and the noun
in noun phrases. Others components of a phrase consist of modifiers. Following the
approach of Arampatzis et al. (2000) every phrase can be represented by a phrase
frame
PF = [h,m]
where the head h gives the central concept of the phrase and the modifiers m serve
to make it more precise. Following this principle for each candidate phrase the
system extracted the head making use of the POS information.
11
LingPipe is free, available at http://www.alias-i.com/lingpipe/index.html
47
Then, for each head extracted the two feature TF × IDF and first occurrence have
been calculated. Afterward, the next step consists of the training or the extraction
depending on the job.
Candidate Phrase Extractor
Syntactic patterns which describe either a precise and well defined entity or concise
events/situations were selected as candidate phrases. Once that all uni-grams, bi-
grams, tri-grams, and four-grams were extracted from the linguistic pre-processor,
they were filtered with the patterns defined above.
Candidate Phrase Scorer
In this phase a score is assigned to each candidate phrase in order to rank it and to
allow the selection of the most appropriate phrases as representative of the original
text.
The score is based on a combination of TF × IDF and first occurrence. However,
since the frequency of a candidate phrase in the whole collection is not significant,
candidate phrases do not appear frequently enough in the collection. As learning
algorithm, it uses the Naïve Bayes Classifier provided by the WEKA package
(Witten and Frank, 1999).
The model obtained is reused in the subsequent steps. When a new document or
corpus is ready we use the pre-processor module to prepare the candidate phrases.
The model we got in the training is then used to score the phrases obtained. In this
case the pre-processing part is the same. So, using the model we got in the training,
48
we extract nouns and verbs from documents, and then we keep the candidate
phrases containing them.
2.4. Linguistic and Statistic Phrases: some remarks
In the past many attempts have been done to use semantically richer features in
Information Retrieval tasks. In particular, a number of authors have investigated
the use of phrases instead of, or in addition to, individual words, as features.
Caropreso et al. (2001) have individuated a number of advantages of using
statistical phrases with respect to syntactic ones:
• they may be recognized by means of more robust and less computationally
demanding algorithms;
• the effect of irrelevant syntactic variants can be factored out;
• uninteresting phrases (e.g. tail professor) tend to be filtered out from
interesting ones (e.g. associate professor);
• the use of lexical atoms, such as “hot dog”, to replace single words for
indexing would increase both precision and recall;
• the use of syntactic phrases, such as “junior college” to supplement single
words would increase precision without hurting recall;
• the use of phrases as index terms, a document that contains a phrase would
be ranked higher than a document that contains just its constituent words in
unrelated contexts.
Arampatzis et al. (2000) have performed an evaluation of a linguistically motivated
indexing model. The approach taken by the authors is based on part-of-speech
(POS) tagging and syntactic pattern matching. Different experiments have been
49
performed with a representation based on combinations of different POS
categories. These representations combine elements belonging to the category of
nouns with those of adjectives, verbs, and adverbs.
Table 3: Results of the experiments performed with part-of-speech-tagging by Arampatzis et al.
(2000). In the last column the percentage of feature reduction wrt the baseline is reported.
The different representational choices are compared to the baseline (i.e. using all
single words as index terms).
To evaluate the different indexing schemes the authors have measured the
performance in a text categorization task. The experimental system is based on the
vector space model, where terms are weighted in a TF × IDF fashion, and
classifiers are constructed automatically using Rocchio’s relevance feedback
method (Salton, 1988). The experiments have been performed on the Reuters-
21578 text categorization collection (Sanderson, 1994) using 90 out of 135
categories and all stemmed words as baseline.
Table 3 summarizes the results discussed below. The experiments with unstemmed,
stemmed and lemmatized words as index terms showed improvements in average
precision less than 5%. The experiments based on indexing sets derived from
50
combinations of part-of speech categories presented, as well, improvements over
the baseline. Moreover, a reduction of the feature space of the 20.8% was obtained.
The authors have concluded that the union of names and adjectives performs best,
the addition of verbs worsens the performance, while adverbs do not make any
difference.
N-grams are another kind of index terms. In their work Meng et al (2002) support
an attempt to improve categorization performance by automatically extracting and
using phrases, especially two word phrases (hereafterbigrams). The bigrams
extracted have been used in addition to (and not, in place of) single words.
The experiments have been performed on two test corpora: a collection of web
pages pointed to by the Yahoo! Science hierarchy and the Reuters-21578 corpus. In
the following we will focus only on the results regarding the Reuters-21578 corpus,
more similar to our experimental corpus.
To measure the performance the authors have used recall and precision. In
particular, the break-even point (BEP), which is the point where recall equals
precision, has been used, and which is often used as a single summarizing measure
for comparing results. Moreover, the F1 measure has been used for evaluating the
performance.
BEP increases in all categories under examination (12 out of 135), with the highest
value at 21.4%. However, the performance as measured by F1 was mixed. While
the largest improvement remained at 27.1%, 5 out of 12 categories showed a drop.
Table 4 shows the recall and precision rates before and after adding bi-grams. The
number of bi-grams the algorithm finds is no more than 2% of the number of single
words, avoiding thus the problem of high dimensionality. Note that when bigrams
51
alone were used, precision decreased drastically, while recall increased
substantially.
Table 4: Recall and precision reported by Meng et al (2002)
When both uni-grams and bi-grams were used, recall improved, without significant
decrease in precision. This means (D’Avanzo, 2005) that bigrams are very good at
identifying correct positives, but at the same time they introduce a large number of
false positives, suggesting that bigrams should be used as complements to the
unigrams.
52
Chapter 3
The Energy domain case study
Energy domain is a very unstructured and large domain. There are many
geopolitical and economic problems caused by energy and there are many
authorities and organizations that treat energy as a problem, but the knowledge
domain is very fragmentary. It could be very difficult to model it and for this
purpose we built an ontology to better knowledge management. We started to
retrieve web documents about “Energy”, first using search engines and then using
crawlers. These documents allowed us to extract keyphrases used as concepts in
order to build the ontology, both manually and semi-automatically.
There are two main approaches to domain modeling (Poesio, 2005):
• Top down (Formalist)
In which we could start from a pre-existent classes hierarchy in order to
deduce concepts and relations belong the ontology until shorter level of
subclasses represented by instances, i.e. the simplest, non complex and non
articulated concepts.
• Bottom up (Empirist or corpora based)
We could start from the basis analyzing a corpus, a collection of documents
extracting concepts, relations and lexical resources useful to populate the
ontology in order to induce, successively, the most generic superclasses.
Energy Domain Modeling is a “middle up down” one because, on the one hand, we
was supported by an energy domain expert, Dr. Brenda Shuffer12, that suggested us
12
University of Haifa.
53
the most relevant superclasses in which we focused our following researches, and
on the other, we continued the ontology taxonomy collecting a large corpus of texts
and extracting concepts by them in order to populate the ontology.
The statistics about Energy Ontology gave us following results:
• Corpus: 200 documents
• Classes: 53 concepts
• Instances: 121 concepts
• Properties: 30 relations (inverse functional are included)
Energy Ontology is more developed in width than in depth, it’s demonstrated in
the following figure.
Figure 3. Energy Ontology Classes
For instance (figure 4), in our ontology we can describe “Saudi Arabia” such an
instance of “Country” class and “Oil” such an instance of “Fossil Fuel” class. So
54
we can link these two instances using relation “to be the major producer of” in
order to infer “Saudi Arabia is one of the major producer of Oil”.
In a preliminary phase we manually built the ontology, then, in a successive step,
we enriched it with new concepts, relations an lexical resources using automatic
systems. Below we describe the manual stage.
We used Protegé 3.3.1, an ontology editor developed at Stanford University to
build Energy Ontology. “Energy_Domain” is our superclass, a sort of root node
that contains all other subclasses, but on the same hierarchy level there are 2
classes also:
• Risks
• Solutions
They have been included in the same level because they are transversal across the
ontology. So, in this way, we could create horizontal links between classes and
instances into the hierarchy tree, they are properties, binary relations expressed
using natural language like the one described in the previous example “to be one of
the major producer of”.
Figure 4. Example of relation between two instances.
INSTANCES:
Saudi Arabia
Russia
Etc.
INSTANCES:
Oil
Natural Gas
Etc.
CLASS: COUNTRY CLASS: FOSSIL FUELS
To be one of the major producer of
55
There are different concepts in Risks within the ontology such as Economic
problems i.e. “Market Instabilities” or “Economic Cost of Oil”; Geopolitical and
Technical Problems i.e. “Dispute between Russia and Belarus about Natural Gas”,
“Dispute between Venezuela and U.S. about Oil”, or for instance we can have
related topics like “Terrorism”, “Sabotage” and “Natural Disasters”.
Regarding Solutions, instead, there are two subclasses containing “Long Term
Actions” and “Short Term Actions”. In this last one, for example, we inserted
“Infrastructure Improvement”, “Use of Renewable Sources”, etc..
Every class or instance is linked to each other using properties (also inverse
functional, figure 5) in order to get a very dense network (figure 6) useful to model
the knowledge domain of energy and to allow us telling something, for example
“Infrastructure Improvement involves Russia”.
Figure 5. Example of inverse functional properties
INVERSE PROPERTY:
is involved in
CLASS: SOLUTIONS >
LONG TERM ACTIONS CLASS: COUNTRY
INSTANCE:
Infrastructure
Improvement
INSTANCE:
Russia
PROPERTY: involves
56
Figure 6. Energy Ontology Network
The Energy Domain Class consists of six classes (as illustrated by figure 7):
Figure 7.
57
1. Country
In this class we created two subclasses OPEC Countries and NON-OPEC
Countries., for example:
• OPEC: Algeria, Angola, Indonesia, Iran, Iraq, Kuwait, Libia,
Nigeria, Qatar, Saudi Arabia, United Arab Emirates, Venezuela.
• NON-OPEC: Belarus, Brazil, Canada, China, Georgia, India,
Malaysia, Mexico, Norvey, Russia, Turkmenistan, Ukraine, United
Kingdom, United States, Uzbekistan.
2. Energy Security
It has four main subclasses:
• Friendliness To The Environment;
• Reliability of Supply, which has two instances (“Disruption” and
“Vulnerability”) and other subclasses relate to “Cut-off in supply”;
• Resources Affordability, which has two subclasses:
� "Emergency Stocks” with “Strategic Petroleum Reserve”
(SPR) as instance;
� “Scarce Resource Affordability”;
• Scarce Resources Dependency which has a subclass (Fossil Fuel
Dependency) with two instances:
� Natural Gas Dependency;
� Oil Dependency.
Energy Security is the class on which we focused our experiments because treats
important economic and geopolitical problems.
58
3. Energy Sources
We classified all energy types in this class, distinguishing three types of
sources: Primary and Secondary, subdivided in turn into Renewable/Non-
Renewable, and Nuclear. We chose this distinction because it’s the most
used.
A primary source of energy is one that already exists in nature and can be
used directly, or converted or re-directed into a form of energy that satisfies
our needs. For example all Fossil Fuels, inside Non-Renewable, such as
Antracite, Bituminous Coal, Coal, Graphite, Lignite, Liquefied Petroleum
Gas, Natural Gas, Oil, Peat, Propane.
Instead, inside Renewable: Biomass, Corn, Geothermal, Photovoltaic,
Solar, Solid Waste, Sugar Beat, Sugar Cane, Thermal, Waste, Water, Wind,
Wood.
Secondary energy sources, such as electric power or refined fuels, do not
exist in nature, but can be produced from the primary energy sources.
Secondary sources are important because they are frequently easier to use
than the primary sources from which they are derived. In this class there are
instances as:
• Non-Renewable: Coke, Diesel Fuel, Gasoline Hydrogen by Coal,
Hydrogen by Fossil Fuel, Hydrogen by Natural Gas;
• Renewable: Alcohol Fuel, Biofuel, Ethanol, Hydrogen by Biomass,
Hydrogen by Solar, Hydrogen by Water.
About Nuclear class we identified two subclasses:
• Nuclear Energy Proliferation;
59
• Nuclear Weapons Proliferation.
Electricity, for example, is the flow of electrical power or charge. It is a
secondary energy source which means that we get it from the conversion of
other sources of energy, like Coal, Natural Gas, Oil, Nuclear Power and
other natural sources, which are called primary sources. The energy sources
we use to make electricity can be renewable or non-renewable, but
electricity itself is neither renewable or non-renewable, it depends on
transformation process (EIA Energy Information Administration)13.
A similar case of difficult classification concerns Hydrogen, it’s can be
classified both as Non-Renewable if it’s extracted by hydrocarbon, both as
Renewable when it’s extracted by renewable elements (water, solar…).
This is the most large class, in fact has a many properties useful to describe
concepts and their relations, following the most relevant:
Coal (Sources class) cause (property) Changes in Climate (Environment
class) AND is used for (property) Cooking OR Heating (Residential Class).
We can use Boolean values “AND” “OR”, for example, in order to add
more information or to create more horizontal links.
4. Energy Use
It contains five subclasses in which the domain literature divides the
society, so considers following sectors: Commercial, Electric, Industrial,
Residential, Transportation. For instance in “Residential” we can have
“Cooking” and “Heating” like instances.
13
http://www.eia.doe.gov/kids/energyfacts/sources/electricity.html.
60
5. Environmental Consequences
We identified one subclass “Changes in Climate” that includes a lot of
environmental impact aspects and everyone is linked to class of Energy
Sources using properties such as “is caused”.
In general we can say that “Changes in Climate” “includes” (property)
many individuals: Flooding, Increase Rain, Global Warming, etc. in this
manner we’ve linked a class with their instances. But an important
inference concerns horizontal links between this class and that one of
Energy Sources:
Natural Gas
Changes in Climate is caused by Biomass
Coal
…
There are many other environmental aspects, for example:
Bird Flight Patterns, CO2 Emissions, Damage to views, Deforestation,
Desertification, Droughts, Flooding, Global Warming, Greenhouse Gas
Emissions, Higher Global Temperatures, Increased Rains, Large Land Use,
Pollution, Radioactive Off Scouring.
6. Infrastructure
Includes ten instances that represent different transports for energy or
structures of supply: Barge, Drilling Equipments, Heliostats, Methane
Pipelines, Oil Tanker, Pipelines, Refinery, Ship, Turbine, Train.
61
3.1 Lexical Acquisition for Ontology population
3.1.1 Ontological organization
There are several methods to organize word within lexicon. In a dictionary, for
example, words are alphabetically organized to provide an easy access, but the
alphabetical order isn’t very helpful to process semantic properties, a more intuitive
way is to organize words according to their meaning. A classification certainly
better represents the structure of our knowledge in a specific domain and is more
adequate for semantic processing. Grouping and classifying words in an ontology
means identifying hyponyms and hypernyms, reverse relationship, using
respectively more specific or more general terms. For example in energy domain
“oil” is hyponyms of “energy source” (hypernyms), there is an “IS A” relation: “oil
IS A type of energy source”. Another important relation is called meronymy, the
opposite is holonymy. For each word (concept) we can link parts to the whole
using relation “HAS A”. In our case “economic cost of oil” means “oil HAS AN
economic cost”. But we can create a double relation using the same keyphrase,
depends on purpose “cost of oil IS AN economic cost”. The relations explained
above help us building a hierarchy tree, but we can enrich a domain ontology using
relations in order to link concepts within different level of hierarchy. Figure 8
shows hierarchy links (full arrows) and relations links (dashed arrows), together
represent the semantic network.
62
Figure 8. Semantic Network: Hierarchy (full arrows) and Relations (dashed arrows).
Ex. 1
“Russia IS A Country of Energy Domain and EXPORTS Gas that IS A type of
Energy Source”
Ex. 2
“Saudi Arabia IS A Country of Energy Domain and EXPORTS Oil that IS A type
of Energy Source”
The most main units in these sentences are called head units consisting in verbs
“IS” and “EXPORTS” because represent fixed element like showed below:
“Saudi Arabia/Russia EXPORTS Oil/Gas”
Energy Domain
Countries Energy Sources
Russia China Saudy Arabia Oil Gas
is a is a
exports
exports
exports
63
“EXPORTS” select “Saudi Arabia” or “Russia” (as subject) on the left and “Oil” or
“Gas” (as object) on the right. Subject and object are grammatical categories, but
they can been called also semantics role such as respectively “Exporter” and
“Thing Exported”. In our examples the role of “Exporter” is always represented by
a “Country”, while the one of “Thing Exported” by an “Energy Source”. In this
way we obtain fixed linguistic distributions, referring to the linguistic theory of
Operators and Arguments (Harris, 1970, 1976) and in according to the selectional
restrictions principle, that help us to reduce syntactic and semantic ambiguity
during parsing analysis.
An example of exhaustive lexical organization: WordNet
An example of exhaustive lexical organization is WordNet (Miller, 1995,
Fellbaum, 1998), a lexical database of English. WordNet builds a matrix (figure 9)
including different meanings of a word in column, defined as polisemy problem,
and different word forms in line, defining a synset, a set of synonyms.
Word meanings
Word forms
F1 F2 … Fn
M1
E1,1
E1,2
M2 E2,2
…
Mn Em,n
Figure 9. The lexical matrix (Miller et al. 1993)
F1 and F2 are synonyms (both have meaning M1) and F2 is polysemous (it has
meaning M1 and M2).
64
3.1.2 Parsing with Case Grammar
Some languages like Latin, Russian and German indicate grammatical functions in
a sentence by a set of inflections, the cases, also defined semantic or thematic
roles (Fillmore, 1968). Parsing with case grammar formalism transform a sentence
into a kind of logical form: the frame (Nugues, 2006). The predicate is the main
verb, and its argument are the roles. The parsing process identifies nouns, tenses,
adverbs, etc. and then, in according to the verb, distinguishes obligatory cases and
optional cases. They will be assigned with at most one value and are constrained by
an ontological type. Figure 10 shows an example using following sentence:
“Saudi Arabia exports oil in Italy with pipelines”
Role Type Value
Agent (Exporter) Obligatory Saudi Arabia
Objective (thing exported)
Obligatory Oil
Where Optional In Italy
How Optional With pipelines
Figure 10.
An example of Case Grammar application: Evar
The EVAR application (Mast et al., 1994) is a spoken dialogue system that answers
inquiries about train connection in Germany using a dedicated ontology. It uses a
partial parser to detect syntactic groups such noun groups, preposition groups,
infinitive groups, verb groups and time and date expressions.
65
EVAR Hello, this is the automatic Intercity information system
USER I want to go to Hamburg
EVAR You want to go to Hamburg
Where do you want to start?
USER From Elrangen
EVAR You want to start from Erlangern
Which day?
USER Tomorrow morning
EVAR You want to travel on Thursday 26th of June between 6 and 12 a.m.
USER Yes
EVAR To Hamburg you can start from Erlangen at 11.01, arrival in
Hamburg…
Shell I repeat?
USER No
EVAR Would you like more informations?
USER No
EVAR Good bye
3.1.3 Lexical Acquisition Fundaments
Collocations
A collocation is an expression consisting of two or more words that correspond to
some conventional way of saying things (Manning and Schütze, 1999), that has
characteristics of a syntactic and semantic unit and whose exact and unambiguous
meaning or connotation cannot be derived from the meaning or connotation of its
66
components (Choueka, 1998). It can include noun phrases “Fossil Fuel”, phrasal
verb like “to cut off (supply)” and other stock phrases “disruption & vulnerability”.
Collocations are characterized by:
• Limited Compositionality
An expression is compositional if its meaning can predicted from meaning of
the parts. For example “acid rain” is a type of rain but we must consider the
expression as single word.
• Limited Substitutability
We cannot substitute other words for the components of collocation even if
they are synonyms. “acid rain” doesn’t means the same as “sour rain”, the last
one is very unusual.
• Limited Modifiability
Many collocations cannot be freely modified with additional lexical material or
grammatical tranformations. “Strategic Petroleum Reserve” (also called using
acronym SPR) is a strict collocation because we can’t modify, for example, the
adjective “strategic” in a relative clauses “Petroleum Reserve that are
Strategic” because is unusual even if its meaning doesn’t change in the context,
the existence of acronym is a confirmation of that.
The rules showed above are more and more substantiated in a technical domain,
during the terminology extraction process, technical terms are more specific than in
a natural language context, so every compound term is a collocation.
To identify collocations in a text or in a corpus it needs to calculate the frequency
counting how many times two (or more) words occur together.
67
In this way we obtain a list of Part of Speech (POS) Pattern, frequency order based
(Table 5).
Table 5 Part of Speech Pattern and frequency
Type of
Phrase
Pattern Example Frequency
Bi-Gram AN
NN
affordable energy
energy conservation
10
5
Tri-Gram NPN
ANN
VPN
APN
NCN
demand for energy
fossil fuel dependency
to invest in
alternatives
vulnerable to
shortages
disruptions &
vulnerability
7
13
4
3
17
Four-Gram ANPN
NPAN
NPNN
wasteful use of
resources
dependence on
foreign oil
cost of oil
dependence
4
8
5
Fifth-Gram ANPNN economic cost of oil
dependence
3
Sixth-Gram ANPNNN macroeconomic cost
of oil market
disruptions
3
Seventh-
Gram
NPNPDNN speed of changes in
the climate system
2
Tenth-Gram DNPAANPNNN The world’s second
largest emitter of
greenhouse gas
emissions
1
The general goal of lexical acquisition is to develop algorithms and statistical
techniques for filling the holes in existing machine readable dictionaries by looking
at the occurrence patterns or words in large text corpora.
68
First we have to define what is a lexicon: the part of the grammar of a language
which includes the lexical entries for all the words in the language and which may
also includes other information (Manning and Schütze, 1999).
A lexicon is a kind of expanded dictionary formatted in a machine readable format.
Traditional dictionaries are written for human usage so quantitative information is
completely absent. An important task of lexical acquisition for Statistical Natural
Language Processing (NLP) is to augment dictionaries with quantitative
information (Manning and Schütze, 1999).
In the following pages we will illustrate many lexical acquisition problems besides
collocations: selectional preferences, subcategorization, semantic
categorization.
1. Selectional preferences
Most verbs prefer specific arguments, for example “export” take “exporter” as
subject and “thing exported” as object (exporter = Country, thing exported =
energy source, in our specific energy domain), these semantic constraints are called
selectional preferences or restrictions. The acquisition of selectional preferences is
important in Statistical NLP for many reasons:
• If a new word is missing from our machine-readable dictionary we can infer
part of its meaning from selectional restrictions.
For example:
“Saudi Arabia exports oil”
“Country A exports lignite”
69
if “lignite” is a new and unknown word, but it occur with verb “export” in the
same context of “oil”, then we can infer that “lignite” is an energy source as
such as “oil”.
• Selectional restrictions are very useful to rank possible parses of a sentence.
In fact we can give high scores to parses where the verb has usual
arguments than to those atypical ones. Semantic regularities captured in
selectional preferences are often quite strong and can be acquired more
easily from corpora than other types of semantic information (like the
meaning).
In selectional preferences model proposed by Resnik (1996) he uses two principles:
1. selectional preference strength
measures how strongly the verb constraints its argument;
2. selectional preference association
measures the association between a verb and a class in which we
defined correlated arguments. For example:
“Saudi Arabia exports oil”
“Country A exports competences”
“exports oil” is a different use respect to “exports competences” because in a
disambiguation process we associate competence with class containing similar
noun in order to verify that “competence” not belongs to the same class of “oil”
(energy source class).
70
2. Verb subcategorization
The verb “export” has two arguments, “exporter” (subject) and “thing exported”
(object) useful to define a phrase structure Noun Phrase (NP)+Verb+Noun Phrase,
as “Saudi Arabia exports Oil”.
A verb expresses its semantic arguments with different syntactic means. The set of
syntactic categories that a verb can appear with is called subcategorization frame
defining a phrase structure. For example:
Example Frame Functions Verb
“Saudi Arabia exports oil” NP NP subject, object export
“China invests in alternatives” NP NP subj., indirect obj. invest
…
Most dictionaries don’t contain information on subcategorization frames and the
information on most verbs is incomplete. A simple and effective algorithm for
learning some subcategorization frames was proposed by Brent (1993) and
implemented in a system called Lerner.
This system decides in which manner the verb V takes the frame F in a corpus in
following two steps:
a. Cues
Define a regular pattern of words and syntactic categories which indicates the
presence of the frame with high certainty. For example we can select the frame
“NP NP” (as showed above).
b. Hypothesis testing
Initially we assume a null hypothesis, the frame is not appropriate for the verb,
but we will reject it if the cue, defined previously, will been identified in the
71
corpus. So the system analyze a corpus counting how many times that cue for
the frame occur with the verb.
3. Semantic Categorization
Attachment Ambiguity
Often there are keyphrases that can be attached to two or more different elements in
the structure. A keyphrase, sequence of two or more words, can have the left-
branching structure [(N N) N] as [(transportation fuel) crisis] that is a “crisis” about
“transportation fuel”; or the right-branching structure [N (N N)] as [woman (aid
worker)] that is an “aid worker” who is a “woman”.
Semantic Similarity
The most conquest of lexical acquisition could be automatically acquire meaning.
But actually many works are focused on semantic similarity measuring how
similar a new word is to known word.
It is most used for generalization under the assumption that semantically similar
words behave similarly (as showed above for selectional preferences “Saudi Arabia
exports oil” and “Country A exports lignite”).
Another use of semantic similarity is class-based generalization considering the
whole class of elements that the word of interest is most likely to be a member of.
K nearest neighbors classification (KNN), instead, uses semantic similarity in
order to classify a new element in a defined category. We first need a training set of
element that are each assigned to a category and then the system classify a new
element to the category that is most prevalent among its k nearest neighbors.
72
3.1.4 Lexical Acquisition Evaluation
An important recent development in NLP has been the use of rigorous standard for
the evaluation of NLP systems.
In an evaluation process often they been used two measures, precision and recall,
using a set of targets (targeted documents in which keyphrases are known).
The system then decides on selected set of documents accepting or rejecting
keyphrases extracted, matching results obtained with the train set.
Precision measures number of keyphrases automatically extracted and
corresponding to manually extracted on the total of automatically extracted: 3/7.
Recall measures number of keyphrases automatically extracted and corresponding
to manually extracted on the total of manually extracted: 3/10.
Figure 11 shows the process:
Manually extracted Automatically extracted
(Training set)
Figure 11.
affordable energy
demand for energy
energy consevation
energy needs
environmental issues
fossil fuel dependency
fossil fuel depletion
to invest in alternatives
oil depletion
peak oil
cut off in supply
demand for energy
economic cost of oil
environmental issues
fossil fuel depletion
oil consumption
renewable energy
73
3.1.5 The role of lexical acquisition in statistical NLP
Lexical acquisition is very important in Statistical NLP for the following reasons:
1. cost of manually building lexical resources
Manual lexical analysis are more accurate but expensive and often humans are
bad at collecting quantitative information.
2. productivity of language
Natural languages are in a constant state of flux, adapting to the changing world
by crating names and word to refer to new things, lexical resources have to be
updated with these changes.
3. lexical coverage
Many dictionaries are incomplete because there are some categories of lexical
entries that have a bad coverage in a common dictionary such as proper nouns,
foreign nouns, codes, abbreviations, etc. so automatic acquisition is useful to
update dictionaries.
3.2 Experiments with Energy domain ontology
3.2.1 Methods and tools
The experiment described in this work consists of following steps. We started from
a corpus of 200 documents about Energy domain in order to manually building a
domain ontology. To be useful an ontology needs to be continuously updated with
new concepts, relations and lexical resources that are represented by keywords in a
document. So manual ontology building is more accurate than automatic one but is
74
more expensive and time consuming too, because for a large corpus of texts it
needs too much time to read all documents and select keywords for everyone.
We focused on one of the classes of the Energy Ontology, Energy Security,
selecting a subset of texts represented by 50 documents, successively we choose 25
documents more relevant about this topic on which we based our experiment
(figure 12).
Figure 12
The ontology population process consisted of three main steps:
1. Documents Retrieval
Generally the documents are manually recovered on the web putting queries in
a search engine, but, in our case, we used a crawler to do it. A crawler is an
automatic agent that we can set using special queries and that automatically
searches documents on the web. We compared three different crawlers:
Infospiders, Best First and SS Spider focused on Energy Security Class using
75
queries made with “class concept + subclass concepts”, (for instance “Energy
Security + Reliability of Supply”); the best one was Infospiders, its results of
precision and recall was the highest.
2. Keywords Extraction from documents.
We read every document and manually extracted significative keyphrases for
each of them. We selected only compound words that are more representative
in a text and that allowed us to describe an over hole document using only a
small collection of terms.
Figure 13 shows an example of text about Energy Security from which we
extracted a keyphrases list.
Figure 13. Example of Keyphrases Extraction.
Reliance on foreign sources of energy and geopolitics There has certainly been a
recognition in recent months and years that energy security is a concern. Even US president
George Bush admitted during his 2006 State of the Union speech that, Keeping America
competitive requires affordable energy. And here we have a serious problem: America is
addicted to oil, which is often imported from unstable parts of the world. The best way to
break this addiction is through technology.
…
The higher prices at petrol pumps in recent months may be a blessing in disguise if it makes
consumers also think more about energy conservation and alternatives, for the market may
respond to that.
…
Oil and other fossil fuel depletion Reliance on foreign sources of energy and geopolitics.
Energy needs and demands of growing countries such as China and India Economic
efficiency versus population growth Need to invest in alternatives to fossil fuels…
…
Oil and other fossil fuel depletion. Many fear that the world is quickly using up the vast but
finite amount of fossil fuels. Some fear we may have already peaked in fossil fuel extraction
and production. So much of the world relies on oil, for example, that if there has been a
peak, or if a peak is imminent, or even if a peak is some way off, it is surely environmentally,
geopolitically and economically sensible to be efficient in use and invest in alternatives.
Keyphrases: affordable energy, energy conservation, energy needs, to invest in
alternatives, fossil fuel depletion
76
Then we tagged obtained keyphrases annotating them with semantic and
linguistic information such as Parts of Speech, this manual process is
supervised by human that use his lexicographic expertise in a specific domain.
Obviously, he can extract a more accurate list of keywords than an automatic
system, but he couldn’t be speeder.
So the manual extraction presented some drawbacks:
• time consumption
• expensiveness
Figure 14 shows an example of PoS annotation using syntactic labels.
Figure 14.
In this stage we identified 322 keyphrases (see Appendix), representing our
gold standard in the experiment because manually extracted and then we
trained LAKE with them in order to compare manual results with automatic
results.
77
3. Classification of Keyphrases as concepts in the ontology.
In the last stage we used automatically extracted keywords in order to populate
the ontology. In this way we could add new concepts for which we assigned
some representative keywords using two approaches with two different tools:
• Manual
Putting extracted keyphrases into an ontology editor, Protégé in our
case, creating classes, instances and properties in order to infer
hyponymy and hyperonymy relations, subsumption relations, and
linking these elements to form a semantic network.
• Semi-automatic
Using a tool like ONTOGEN based on machine learning approach
extracting automatically keywords from a large corpus of documents
and training the system in order to classify these keywords as concepts
of the ontology. Ontogen allowed us to create subsumption relations
between extracted concepts.
Lexical acquisition is a methodology that allows us to get lexical units in order to
describe a concept using lexical knowledge. In our experiment we considered
Keyphrase Extraction (KE) such as a lexical acquisition technique. It’s an
automatic method to extract relevant keyphrases from a text and we used it because
allowed us to populate the ontology with new concepts, relations and lexical
resources.
This work focused on the use of a linguistically motivated methodology to identify
candidate keyphrases. After the identification of the linguistically motivated
candidate keyphrases making use of named entities, multiwords, sequences of PoS
78
tagging (referred as patterns), the process keeps on by selecting the best candidates
based upon a learning device.
The methodology has been implemented in LAKE, a keyphrase extractor based on
a supervised learning approach (see previous sections for more details).
In the former case, the focus was on bi-grams (for instance Named Entity, noun,
and sequences of adjective+noun, etc.), while in the latter longer sequences of parts
of speech are considered, often containing verbal forms (for instance
noun+verb+adjective+noun). Sequences such as noun+adjective that are not
allowed in English were not taken into consideration. Patterns containing
punctuation have been eliminated. A restricted number of PoS sequences have been
manually selected that could have been significant in order to describe the setting,
the protagonists and the main events of a document. To this end, particular
emphasis was given to named entities, proper and common names.
In a keyphrase the head unit is most important because represents the concept
from which depend other components, for example in “affordable energy”
“energy” is the head unit because “affordable energy” is a type of “energy”, the
same in “energy conservation” which means a type of “conservation” and so on.
Head unit helps us to identify possible co-occurrences and transformations (Harris
1970) that can be used in a sentence starting from that specific keyphrase.
Type of Phrase Pattern Example Head unit
Bi-Gram AN NN
affordable energy energy conservation
N (energy) N (conservation)
79
It has been decided to estimate the values of the TF*IDF using the head of the
candidate phrase, instead of the phrase itself, according to the principle of
headedness (Arampatzis, 2000), any phrase has a single word as a head. The head
is the main verb in the case of verb phrases, and a noun (last noun before any
postmodifiers) in noun phrases. As learning algorithm, it has been used the Naïve
Bayes Classifier provided by the WEKA package (Witten and Frank, 1999).
The classifier was trained on a corpus with the available keyphrases. Each of them
was marked as a positive example of a relevant keyphrase for a certain document if
it was present in the assessor’s judgment of that document; otherwise it was
marked as a negative example.
Then the two features (i.e. TF*IDF and first occurrence) were calculated for each
word. The classifier was trained upon this material and a ranked word list was
returned (e.g. energy, oil, exporter, etc.). The system automatically looks in the
candidate phrases for those phrases containing these words. In our case affordable
energy, Arabian oil, main exporter, etc. The top candidate phrases matching the
word output of the classifier are kept.
3.2.2 Results
In the experiment we used 25 documents for which we knew manually extracted
keyphrases and calculated the average of precision and recall. The results were
quite encouraging because we got about 56% of precision and 40% of recall (figure
15).
First of all, we trained LAKE system using 20 documents and then we used the
remaining 5 documents to test the system.
80
Figure 15. Average Precision and Average Recall
In the experiment on Energy Security Class we obtained patterns consisting of bi-
gram until four-gram like above:
Type of Phrase
Pattern Example Head unit
Bi-Gram AN NN
affordable energy energy conservation
N (energy) N (conservation)
Tri-Gram NPN ANN VPN APN NCN
demand for energy fossil fuel dependency to invest in alternatives vulnerable to shortages disruptions & vulnerability
N (demand) N (dependency) V (to invest) A (vulnerable) N (distr. – vuln.)
Four-Gram ANPN NPAN ANNN AANN NPNN
wasteful use of resources dependence on foreign oil liquid transportation fuels crisis unstable foreign oil supplier cost of oil dependence
N (use) N (dependence) N (crisis) N (supplier) N (cost)
Syntactic patterns have a twofold objective:
• focusing on bi-grams (for instance Named Entity, noun, and sequences of
adjective+noun, etc.) to describe a precise and well defined entity;
• considering longer sequences of PoS, often containing verbal forms (for
81
instance noun+verb+adjective+noun) to describe concise events/situations.
Once all the bi-grams, tri-grams, and four-grams are extracted from the linguistic
pre-processor, they are filtered with the patterns defined above. The result of this
process is a set of keyphrases that may represent the current document.
3.3 Discussion
In this study we presented a linguistically motivated text mining approach using,
mainly, lexical acquisition and keyphrase extraction methodologies in order to
populate the energy ontology, first manually and then automatically. In the manual
stage, we extracted 322 keyphrases from 25 documents that are well defined
because supervised by the lexicographer expertise. In the automatic stage we used
LAKE as extractor of keyphrases on the same documents. Finally, we compared
both approaches, comparing results and calculating the average of precision and
recall. The results are quite encouraging because the precision measures 56% and
recall 40%. LAKE extracted, for more than half of cases, relevant keywords and
many of which correspond to those human. Furthermore, the most important results
concern some isolated cases of keywords automatically extracted by system and
deemed relevant, but not covered by the lexicographer during first manual
extraction step, they mainly concern Energy domain and not specifically Energy
Security. Since we manually extracted keyphrases from those 25 documents only
about Energy Security, we can suppose that more high results could be obtained if
we compare automatically extracted keyphrases with manually extracted ones
considering the general Energy domain.
82
Chapter 4
Conclusions
Ontology population is a serious problem that we solved using two main
approaches, manual and automatic. First we retrieved 200 documents on the web
using search engines and crawlers building manually the most important part of the
energy domain ontology. We compared three different crawlers: Infospiders, Best
First and SS Spider focused on Energy Security Class using queries made with
“class concept + subclass concepts”, (for instance “Energy Security + Reliability of
Supply”); the best one was Infospiders, its results of precision and recall was the
highest. In a successive step we red the most part of documents to manually
building the energy domain ontology. Then we considered many automatic or
semi-automatic techniques for populate the ontology, but in our experiment we
used Lexical Acquisition training LAKE, a keyphrases extraction algorithm, with
25 linguistically annotated documents. At last we compared manual versus
automatic ontology building and the results were quite encouraging, calculating the
average of two measures, precision and recall, respectively of 56% and 40%.
Obviously a manual building of an ontology is more accurate because the human is
an expert lexicographer, but it’s more time consuming and expensive, so an
automatic or semi-automatic approach can be very useful in some steps.
4.1 Future Works
In a future work we will enlarge the corpus for the experiment considering 200 and
much more documents.
Successively we want to generate a Local Grammar for energy domain ontology.
83
It’s a bottom up approach and could help us to describe in which manner we use
language in a specific domain. A statement for a Local Grammar is composed by
lexical resources and part of speech pattern (figure 16), for example about Energy
Security we can have some keyphrases and type of phrases (affordable energy AN=
adjective+noun, reliability to supply NPN=noun+preposition+noun, etc.) and we
can build not only a controlled vocabulary but also we can better describe these
local syntactic constraints.
Figure 16. Example of Local Grammar
84
Appendix
Keyphrases manually extracted
affordable/energy,.AN:s+/N
air/pollution/stemming,.NNN:s-/N
Alaska's/Arctic/National/Wildlife/Refuge,.NPANAN:s-/N
alternative/energy/source,.ANN:s+/N
alternative/fuel,.AN:s+/N
alternative/supply,.AN:s+/N
American-made/car,.AAN:s+/N
atmospheric/GHG/concentration,.ANN:s+/N
atmospheric/greenhouse/gas/concentration,.ANNN:s+/N
availability/of/reliable/and/affordable/energy/supplies,.NPACANN:s+/N
availability/of/water,.APN:s-/N
cartel’s/market/share,.NPNN:s+/N
CECP/label,.NN:s+/N
Certification/Center/for/Energy/Conservation/Products,.NNPNNN:s-/N
China’s/crude/oil/import,.NPANN:s+/N
China’s/energy/consumption,.NPNN:s+/N
China’s/energy/demand,.NPNN:s+/N
China’s/expanding/search/for/oil/resources,.NPNNPNN:s-/N
China’s/growing/energy/security,.NPNNN:s-/N
China’s/growing/LNG/demand,.NPNNN:s-/N
China’s/impact/on/the/global/oil/markets,.NPNPDANN:s+/N
China’s/oil/demand/and/supply,.NPNNCN:s+/N
China’s/thirst/for/foreign/oil,.NPNPAN:s-/N
Chinese/economy/expansion,.ANN:s-/N
Chinese/energy/investment,.ANN:s+/N
85
clean/coal,.AN:s+/N
clean/energy/and/water/protection/in/China,.ANCNNPN:s+/N
clean/technology,.AN:s+/N
climate/change,.NN:s+/N
climate/change/strategy,.NNN:s+/N
climate/polluter,.NN:s+/N
coal/burning/for/energy,.NNPN:s-/N
commercial/stocks/decline,.ANN:s+/N
competition/for/fresh/water,.NPAN:s-/N
competition/for/resources,.NPN:s+/N
construction/of/a/heavywater/production/plant,.NPDNNN:s+/N
conventional/use/of/coal,.ANPN:s+/N
cooperation/or/competition/for/energy,.NCNPN:s-/N
cooperative/security,.AN:s-/N
cost/of/oil/dependence,.NPNN:s+/N
cut/in/oil/supply,.NPNN:s+/N
degree/of/import/concentration,.APNN:s+/N
demand/for/energy,.NPN:s-/N
demand/for/water,.NPN:s+/N
demand/growth,.NN:s-/N
dependence/on/foreign/oil,.NPAN:s-/N
dependence/on/gas,.NPN:s-/N
dependence/on/imports,.NPN:s-/N
dependence/on/oil,.NPN:s-/N
dependency/on/oil/imports,.NPNN:s-/N
dependent/on/the/Middle/East/for/oil,.NPDANPN:s+/A
destruction/of/water/ecosystems,.NPNN:s+/N
development/of/nuclear/technology,.NPAN:s+/N
disruption/in/gas/supplies,.NPNN:s+/N
86
disruption/in/oil/supplies,.NPNN:s+/N
disruption/of/energy/supplies,.NPNN:s+/N
disruption/of/Venezeluan/oil/supplies,.NPNNN:s+/N
disruptions/&/vulnerability,.NCN:s-/N
disruptions/and/vulnerability,.NCN:s-/N
distribution/&/transmission/annual/mileage,.NCNAN:s+/N
distribution/and/transmission/annual/mileage,.NCNAN:s+/N
diverse/supply/of/reliable,.ANPN:s+/N
diversification/of/energy,.NPN:s+/N
diversification/of/energy/sources,.NPNN:s-/N
diversification/project,.NN:s+/N
diversification/source,.AN:s+/N
domestic/alter/supply,.AAN:s+/N
domestic/alternative/to/LNG,.ANPN:s+/N
domestic/oil/production,.ANN:s+/N
draw/rate/capability,.VNN:s+/N
drawdown/coordination,.NN:s-/N
echologic/vehicle,.AN:s+/N
ecological/problem,.AN:s+/N
economic/cost/of/oil/dependence,.ANPNN:s+/N
economic/dependence,.AN:s-/N
economy/stimulation,.NN:s+/N
efficient/fuel,.AN:s+/N
Efficient/Lighting/Initiative,.ANN:s-/N
electricity/shortage,.NN:s+/N
electricity/supply,.NN:s+/N
emergency/preparedness,.NN:s-/N
emergency/stock,.NN:s+/N
emergency/system,.NN:s+/N
87
energy/conservation,.NN:s-/N
energy/consumption,.NN:s+/N
energy/cooperation,.NN:s-/N
energy/crisis,.NN:s+/N
energy/diversification/and/development,.NNCN:s+/N
energy/efficiency,.NN:s-/N
energy/efficiency/labeling,.NNN:s-/N
energy/efficiency/product,.NNN:s+/N
energy/independence,.NN:s-/N
energy/independence,.NN:s-/N
energy/independence,.NN:s-/N
energy/indipendence,.NN:s-/N
energy/insecurity,.NN:s-/N
energy/interdependence,.NN:s-/N
energy/interest,.NN:s+/N
energy/policy,.NN:s+/N
energy/price,.NN:s+/N
energy/price/fluctuation,.NNN:s+/N
energy/security,.NN:s-/N
energy/security/measure,.NNN:s+/N
energy/security/solution,.NNN:s+/N
energy/shortage,.NN:s+/N
energy/strategy,.NN:s+/N
energy/supplier,.NN:s+/N
energy/supply,.NN:s+/N
energy/technology,.NN:s+/N
energy-dependent/state,.NAN:s+/N
energy-hungry/economy,.NAN:s+/N
energy-transit/monopoly,.NNN:s+/N
88
environment/preservation,.NN:s-/N
environmental/and/social/impact,.ACAN:s+/N
environmental/degradation,.AN:s-/N
environmental/degradation,.AN:s-/N
environmental/impact/of/electricity/generation,.ANPNN:s+/N
environmental/issue,.AN:s+/N
environmental/regulation,.AN:s+/N
environmental/repercussion,.AN:s+/N
environmentally/energy,.AN:s+/N
foreign/policy,.AN:s+/N
fossil/fuel/dependency,.ANN:s-/N
fossil/fuel/depletion,.ANN:s-/N
fuel/efficiency,.NN:s+/N
fuel-efficient/motor/oil,.NANN:s+/N
fuels/crisis,.NN:s+/N
fuels/independence,.NN:s-/N
gas/distribution,.NN:s-/N
gas/gathering,.NN:s-/N
gas/transmission/offshore,.NNA:s-/N
gas/transmission/onshore,.NNA:s-/N
geopolitical/change,.AN:s+/N
geopolitical/crisis,.AN:s+/N
geopolitical/leverage,.AN:s+/N
geopolitical/risk,.AN:s+/N
geopolitical/security,.AN:s-/N
global/environment,.AN:s+/N
Global/Environment/Facility,.ANN:s-/N
global/insecurity,.AN:s-/N
global/oil/consumption,.ANN:s-/N
89
global/oil/demand/growth,.ANNN:s-/N
global/oil/trade,.ANN:s+/N
global/supply/of/oil,.ANPN:s-/N
global/temperature/rise,.ANN:s+/N
global/warming,.AN:s-/N
global/warming,.AN:s-/N
global/warming/emission,.ANN:s+/N
green/labeling,.AN:s-/N
growing/demand/for/energy,.ANPN:s+/N
hazardous/liquid/offshore,.ANA:s+/N
hazardous/liquid/onshore,.ANA:s+/N
hight/energy/price,.ANN:s+/N
hight/energy/price,.NNN:s+/N
hydrogen/fuel/cell,.NNN:s+/N
importation/of/nuclear/material,.NPAN:s+/N
improving/fuel/economy,.ANN:s+/N
invest/in/alternatives,.VPN/V
investment/strategy,.NN:s+/N
Iran’s/civilian/nuclear/energy/infrastructure,.NPAANN:s+/N
Iran’s/nuclear/aspiration,.NPAN:s+/N
Iran’s/nuclear/program,.NPAN:s+/N
James/Schlesinger,.NN:s-/N
liquid/accident,.AN:s+/N
liquid/fuels/independence,.ANN:s-/N
liquid/transportation/fuels/crisis,.ANNN:s+/N
Liquified/Natural/Gas/dependence,.AANN:s-/N
Liquified/Natural/Gas/terminal,.AANN:s+/N
LNG/dependence,.NN:s-/N
LNG/supply/chain,.NNN:s+/N
90
LNG/terminal,.NN:s+/N
longer-term/security,.ANN:s-/N
long-term/energy/policy,.ANNN:s+/N
macroeconomic/cost/of/oil/market/disruptions,.ANPNNN:s+/N
main/oil/nigerian/city,.ANAN:s+/N
military/expenditure,.AN:s+/N
military/implications,.AN:p-/N
mismanagement/and/hydropower/construction,.NCNN:s+/N
monopolistic/cartel,.AN:s+/N
national/security/implication,.ANN:s+/N
national/security/implication,.ANN:s+/N
natural/gas/transit/monopoly,.ANNN:s+/N
Net/Benefits/of/Stockpile/Expansion,.ANPNN:p-/N
Nigeria/violence,.NN:s-/N
no/energy/indipendence,.CNN:s-/N
nonproliferation/norm,.NN:s+/N
Nuclear/Non-Proliferation/Treaty,.ACNN:s-/N
nuclear/proliferation/in/Iran,.ANPN:s-/N
OECD/country,.NN:s+/N
oil/consumer,.NN:s+/N
oil/consumption,.NN:s+/N
oil/cutoff,.NN:s+/N
oil/demand,.NN:s+/N
oil/dependence,.NN:s-/N
oil/dependence/problem,.NNN:s+/N
oil/depletion,.NN:s-/N
oil/disruption,.NN:s+/N
oil/import,.NN:s+/N
oil/independence,.NN:s-/N
91
oil/market,.NN:s-/N
oil/market/condition,.NNN:s+/N
oil/market/volatility,.NNN:s-/N
oil/peak,.NN:s+/N
oil/price,.NN:s+/N
oil/price/impact,.NNN:s+/N
oil/price/stabilization,.NNN:s-/N
oil/price/volatility,.NNN:s+/N
oil/producer,.NN:s+/N
oil/production,.NN:s+/N
oil/profit,.NN:s+/N
oil/recovery/program,.NNN:s+/N
oil/reserve,.NN:s+/N
oil/security/project,.NNN:s+/N
oil/security/resource,.NNN:s+/N
oil/shock,.NN:s+/N
oil/stockpiling,.NN:s+/N
oil/stockpilings,oil/stockpiling.NN:p+/N
oil/supplier,.NN:s+/N
oil/supply,.NN:s+/N
oil/supply/disruption,.NNN:s+/N
oil/use,.NN:s+/N
oil/use/reduction,.NNN:s+/N
oil-driven/world/economy,.ANN:s+/N
Opec/cartel,.NN:s+/N
OPEC/member,.NN:s+/N
Opec/price/manipulation,.NNN:s+/N
Pakistan/instability,.NN:s-/N
partial/monopoly,.AN:s-/N
92
peak/oil,.NN:s-/N
peak/oil,.NN:s+/N
petroleum/dependence,.NN:s-/N
petroleum/import,.NN:s+/N
petroleum/security,.NN:s-/N
petroleum/stock,.NN:s+/N
petroleum/supply,.NN:s+/N
petroleum/vulnerability,.NN:s+/N
pipeline/incident,.NN:s+/N
pipeline/operator,.NN:s+/N
pipeline/safety,.NN:s-/N
pipeline/safety/program,.NNN:s+/N
pipeline/system,.NN:s+/N
policy/and/decision-making/process,.NCNVN:s+/N
policy/coordination,.NN:s-/N
policy/measure,.NN:s+/N
political/feasibility,.AN:s-/N
political/promise,.AN:s+/N
potential/cost/of/oil/dependence,.ANPNN:s+/N
potential/energy/crisis,.ANN:s+/N
promoting/alternative,.VN/V
public/transportation,.AN:s+/N
recession/fear,.NN:s+/N
regional/security,.AN:s-/N
reliable/energy,.AN:s+/N
reliable/supply,.AN:s+/N
reliance/on/foreign/sources/of/energy,.NPANPN:s-/N
renewable/energy/source,.ANN:s+/N
reserve/size,.NN:s+/N
93
resource/depletion,.NN:s-/N
resource/scarcity,.NN:s-/N
rising/energy/price,.ANN:s+/N
rising/terrorism,.AN:S-/N
risk/of/nuclear/proliferation,.NPAN:s+/N
sea/lane,.NN:s+/N
second/oil/imports/availability,.ANNN:s-/N
secure/supply/of/oil/and/gas,.ANPNCN:s+/N
security/cooperation,.NN:s-/N
security/of/supply,.NPN:s-/N
security/of/transit,.NPN:s-/N
security-environment/coalition,.NNN:s+/N
shortage/of/energy/supplies,.NPNN:s+/N
short-term/stability,.ANN:s-/N
Sino-U.S./energy/relation,.ANN:s+/N
small/number/of/supplies,.ANPN:s+/N
social/and/environmental/implication/of/the/pipeline’s/development,.ACANPDNPN:s+/N
sourcing/diversification,.NN:s+/N
speed/of/changes/in/the/climate/system,.NPNPDNN:s-/N
SPR,.N:s+/N
stability/of/nations/that/supply/energy,.NPNPVN:s-/N
stable/supply,.AN:s+/N
stockpile/expansion,.NN:s-/N
stockpile/size,.NN:s+/N
stockpile/use,.NN:s+/N
stockpiling/management,.NN:s-/N
strategic/fuel,.AN:s+/N
strategic/oil/reserve,.ANN:s+/N
strategic/oil/stockpiling,.ANN:s+/N
94
Strategic/Petroleum/Reserve,.ANN:s+/N
strategic/reserve,.AN:s+/N
strategic/reserves,strategic/reserve.AN:p+/N
supply/disruption,.NN:s+/N
supply/instability,.NN:s+/N
supporting/dictatorship,.AN:s+/N
sustainable/development/goal,.ANN:s+/N
sustainable/energy/future,.ANN:s-/N
technological/improvement,.AN:s+/N
technological/process,.AN:s+/N
the/largest/consumer/of/coal,.DANPN:s+/N
the/world’s/second/largest/emitter/of/greenhouse/gas/emissions,.DNPAANPNNN:s-/N
to diversify/supply,.VN/V
to/develop/cleaner/energy/resources,.PVANN/V
to/increase/use/of/gas,.PVNPN/V
transfer/of/wealth,.NPN:s-/N
United/States/oil/consumption,.ANNN:s-/N
unstable/foreign/oil/supplier,.AANN:s+/N
uranium/enrichment,.NN:s+/N
US/dependence/on/petroleum/inports,.ANNPNN:s-/N
vehicle/technology,.NN:s+N
vulnerable/economy,.AN:s+/N
vulnerable/energy/infrastructure,.ANN:s+/N
vulnerable/to/disruptions,.APN:s+/N
vulnerable/to/shortages,.APN:s+/A
vulnerable/to/shortages,.APN:s+/A
wasteful/use/of/resources,.ANPN:s+/N
water/resources/control,.NNN:s+/N
water/security,.NN:s-/N
95
water/security/crisis,.NNN:s+/N
water/shortage,.NN:s+/N
water/supply,.NN:s+/N
weak/dollar,.AN:s-/N
West-East/natural/gas/pipeline/project,.AANNN:s+/N
West-East/pipeline/project,.ANN:s+/N
world/oil/market,.NNN:s+/N
world/oil/price,.NNN:s+/N
96
Bibliography
Alani H., Kim S., Millard D. E., Weal M. J., Hall W., Lewis P. H. and Shadbolt N.
R., Automatic Ontology-Based Knowledge Extraction from Web Documents,
University of Southampton, 2003
Allemang D. Ontologies, Reuse and Domain Analysis, TopQuadrant, Inc., 2006
Arampatzis A., van der Weide T., Koster C. and van Bommel P. An evaluation of
linguistically-motivated indexing schemes. In Proceedings of the BCSIRSG
’2000, 2000
Arampatzis A., van der Weide T., Koster C. and van Bommel P. Linguistically
motivated information retrieval. In Allen Kent, editor, Encyclopedia of
Library and Information Science, volume 69. Marcel Dekker, Inc., New
York, Basel, December 2000. To appear. Currently available on-line from
http://www.cs.kun.nl/ avgerino/encyclopTR.ps.Z
Barker K. and Cornacchia N. Using noun phrase heads to extract document
keyphrases. In Proceedings of the Thirteenth Canadian Conference on
Artificial Intelligence, pages 40–52, 2000
Berners-Lee T.J., Hendler J., Lassila O. The Semantic Web, Scientific American,
May 2001, pp. 28-37
http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html
Berners-Lee T.J., Cailliau R., Groff J.F. The world-wide web, In Congrès Joint
European networking conference No3, Innsbruck, AUTRICHE 1992, vol. 25,
no 4-5 (270 p.) (6 ref.), pp. 454-459, 1992
Bourigault D. Surface grammatical analysis for the extraction of terminological
noun phrases. In Proceedings of COLING 92, 1992
97
Brambring M.. Mobility and orientation processes of the blind. In D. H. Warren
and E. R. Strelow, editors, Electronic Spatial Sensing for the Blind, pages
493–508, USA, 1984. Dordrecht, Lancaster, Nijhoff
Brent M.R. From grammar to lexicon: Unsupervised learning of lexical syntax.
Computational Linguistics 19:243-262, 1993
Brill E. Some advances in transformation-based part of speech tagging. In National
Conference on Artificial Intelligence, 1994
Caropreso M.F., Matwin S. and Sebastiani F. A learner-independent evaluation of
the usefulness of statistical phrases for automated text categorization. In
Amita G. Chin, editor, Text Databases and Document Management: Theory
and Practice, pages 78–102. Idea Group Publishing, Hershey, US, 2001
Carreras X., Màrques L. and Padrò L. Named entity extraction using adaboost. In
Proceedings of CoNLL-2002, pages 167–170. Taipei, Taiwan, 2002
Celjuska D. Semi-automatic Construction of Ontologies from Text. Master’s Thesis,
Department of Artificial Intelligence and Cybernetics, Technical University
Kosice, 2004
Celjuska D., Vargas M. Ontosophie: A Semi-Automatic System for Ontology
Population from Text, KMi - Knowledge Media Institute, The Open
University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom, 2004
Charniak E. Statistical Language Learning. MIT Press, 1993
Chieko A. and Lewis C. Home page reader: IBM’s talking web browser. In Closing
the Gap Conference Proceedings, 1998
98
Choueka Y. Looking for needles in a haystack or locating interesting collocational
expressions in large textual database. In Proceedings of the RIAO, pp. 38-43,
1998
Church K. and Hanks P. Word association, norms, mutual information, and
lexicography. Computational Linguistics, 16(1), 1990
Clark P. and Boswell R. Rule induction with CN2: Some recent improvements. In
Proceedings of the 5th European Working Sessions on Learning, pages 151-
163, Porto, Portugal, 1991
Cucerzan S. and Yarowsky D. Language independent named entity recognition
combining morphological and contextual evidence. In Proceedings of 1999
Joint SIGDAT Conference on EMNLP and VLC, 1999
Dagan I. and Itai A. Word sense disambiguation using a second language
monolingual corpus. Computational Linguistics, 20:563–596, 1994
Daille B., Gaussier E. and Lange J.M. Towards automatic extraction of
monolingual and bilingual terminology. In Proceedings of COLING 94, 1994
Damerau F. Generating and evaluating domain oriented multi-word terms from
texts. Information Processing and Management, 29(4), 1993
D’Avanzo E. Using Keyphrases fo Text Mining: Applications and Evaluation. PhD
Dissertation Series. Department of Information and Communication Sciences,
University of Trento. December 2005
D’Avanzo E., Elia A., Kuflik T., Vietri S. LAKE system at DUC 2007. In
Proceedings of the Document Understanding Conference, NAACL-HLT
2007, Rochester, NY, USA, April 22-27, 2007
99
D’Avanzo E., Magnini B. A Keyphrase-Based Approach to Summarization: the
LAKE System at DUC-2005. In Proceedings of Document Understanding
Workshop, HLT/EMNLP 2005, Vancouver, B.C., Canada, October 6-8, 2005
D’Avanzo E., Magnini B., Vallin A. Keyphrase Extraction for Summarization
Purposes: The LAKE System at DUC-2004. In Proceedings of Document
Understanding Workshop HLT/NAACL 2004. Boston, USA, May 6-7, 2004
Evans D.K., Klavans J.L. and Wacholder N. Document processing with linkit. In
Proceedings of the RIAO Conference, 2000
Fano R. Transmission of Information: A statistical theory of communications. MIT
press, MA, 1961
Fayyad U.M. and Irani K.B. Multi-interval discretization of continuous-valued
attributes for classification learning. In IJCAI, 1993
Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998
Figenbaum E.A. and Feldman J., editors, Computers and Thought, McGraw-Hill,
New York, 1963
Fillmore C.J. The case for case. In Bach E. and Harms R.T., editors, Universal in
Linguistic Theory, pages 1-88. Holt, Rinheart and Winston, New York, 1968
Florian R., Ittycheriah A., Jing H. and Zhang T. Named entity recognition through
classifier combination. In Proceedings of CoNLL-2003, 2003
Frank E., Paynter G.W., Witten I.H., Gutwin C. and Craig G. Nevill-Manning.
Domain-specific keyphrase extraction. In IJCAI, pages 668–673, 1999
Geleijnse G., Korst J. Automatic Ontology Population by Googling, Philips
Research Laboratories, 2005
100
Gruber T.R. A Translation Approach to Portable Ontology Specifications.
Knowledge Acquisition, 5(2), 1993, pp. 199-220
Harper S. and Patel N. Gist summaries for visually impaired surfers. In
Proceedings of the 7th international ACM SIGACCESS conference on Computers
and accessibility, pp. 90-97, 2005
Harris Z.S. Notes au cours de syntaxe, Paris, Larousse, 1976
Harris Z.S. Papers in Structural and Transformational Linguistics, Reidel,
Dordrect, 1970
Harris Z.S. (1952), Discourse Analisys, in Harris 1970, pp. 313-347
Hearst M. A. Automatic Acquisition of Hyponyms from Large Text Corpora. In
Proceedings of the Fourteenth International Conference on Computational
Linguistics, 1992
Hulth A. Improved automatic keyword extraction given more linguistic knowledge.
In Empirical Methods in Natural Language Processing, 2003
Jackson P. and Moulinier I. Natural Language Processing for Online Applications.
John Benjamins Publishing Company, 2002
Jones S., Jones M. and Shaleen D.S. Using keyphrases as search result surrogates
on small screen devices. Personal Ubiquitous Comput., 8(1):55–68, 2004
Justeson J.S. and Katz S.M. Technical terminology: some linguistic properties and
an algorithm for identification in text. Natural Language Engineering, 1,
1995
Kantardzic M. Data Mining. IEEE Press, 2003
Kupiec J., Pedersen J. and Chen F. A Trainable Document Summarizer, Xerox Palo
Alto Research Center
101
Lovins J.B. Development of a stemming algorithm. Mechanical Translation and
Computational, 11:22–31, 1968
Manning C.D. and Schütze H. Foundations of Statistical Natural Language
Processing. The MIT Press Cambridge, Massachusetts, London, England,
1999
Marchionini G. Information Seeking in Electronic Environments. Cambridge
University Press, 1995
Mast M., Kummert F., Ehrlich U., Fink G.A., Kuhn T., Niemann H., Sagerer G. a
speech understanding and dialog system with homogeneous linguistic
knowledge base. IEEE Transaction on Pattern Analysis and Machine
Intelligence, 16(2):179-194, 1994
Meng Tan C., Fang Wang Y. and Do Lee C. The use of bigrams to enhance text
categorization. Inf. Process. Manage., 38(4):529–546, 2002
Miller G.A. WordNet: A lexical database for English. Communications of the
ACM 38(11):39-41, 1995
Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., Tangi R. Five papers
on WordNet. Technical report, Princeton University, 1993.
ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps. Cited 28 October
2005.
Mitchell T. Machine learning, McGraw Hill, 1997
Moens M.F. Automatic Indexing and Abstracting of Document Texts. Kluwer
Accademic, 2000
Nugues P.M. An introduction to language processing with Perl and Prolog. Berlin:
Springer, 2006
102
Palmer D.D. and Day D.S. A statistical profile of the named entity task. In Fifth
ACL Conference for Applied Natural Language Processing (ANLP-97), 1997
Pianta E., Bentivogli L. and Girardi C. Multiwordnet: developing an aligned
multilingual database. In Proceedings of the First International Conference
on Global WordNet, 2002
Poesio M. Domain modelling and NLP: Formal ontologies? Lexica? Or a bit of
both?, in Applied Ontology, 1(1):27–33, 2005
Quinlan J.R. Learning decision tree classifiers. ACM Comput. Surv., 28(1):71–72,
1996
Quinlan J.R. C4.5: programs for machine learning. Morgan Kaufmann Publishers
Inc., 1993
Quinlan J.R. Induction of decision trees. Machine Learning, 1(1):81–106, 1986
Resnik P. Selectional constraints: an information-theoretic model and its
computational realization. Cognition 61:127-159, 1996
Riloff, E. Automatically Generating Extraction Patterns from Untagged Text. In
AAAI-96. 1996
Salton G., editor. Automatic text processing. Addison-Wesley Longman Publishing
Co., Inc., Boston, MA, USA, 1988
Sanderson M. Reuters Test Collection. In BSC IRSG, 1994
Soderland W., Fisher D., Aseltine J., and Lehnert W. CRYSTAL: Inducing a
Conceptual Dictionary, in Proceedings of the International Joint Conference
on Artificial Intelligence, Montreal, Canada. pp. 1314-1319, 1995
Song M., Song I.Y. and Hu X. Kpspotter: a flexible information gain-based
keyphrase extraction system. In Proceedings of the fifth ACM international
103
workshop on Web information and data management, pages 50–53. ACM
Press, 2003
Turney P.D. Mining the web for lexical knowledge to improve keyphrase
extraction: Learning from labeled and unlabeled data. Technical Report
ERB-1096. (NRC #44947), National Research Council, Institute for
Information Technology, 2002
Turney P.D. Learning algorithms for keyphrase extraction. Information Retrieval, 2
(4):303–336, 2000
Turney P.D. Learning to extract keyphrases from text. Technical Report ERB-1057.
(NRC #41622), National Research Council, Institute for Information
Technology, 1999
Turney P.D. Extraction of keyphrases from text: Evaluation of four algorithms.
Technical Report ERB-1051. (NRC #41550), National Research Council,
Institute for Information Technology, 1997
Witten I.H. and Frank E. Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. Morgan Kaufmann, 1999
Witten I.H., Paynter G.W., Frank E. Gutwin C. and Craig G. KEA: Practical
automatic keyphrase extraction. In ACM DL, pages 254–255, 1999