Applying semantically enhanced web mining techniques for building

Facoltà di Lettere e Filosofia Corso di Laurea Specialistica

in Comunicazione d’Impresa e Pubblica

Tesi di Laurea in

Informatica per il Commercio Elettronico

Applying semantically enhanced web mining techniques for building a domain ontology

Supervisors Candidata Prof. Ernesto D’Avanzo Federica Marano matr. 0320/400100 Prof. Annibale Elia Prof. Tsvi Kuflik

Anno accademico 2007-2008

2

Acknowledgments

First of all, I tank Ernesto D’Avanzo e Tsvi Kuflik, that followed my thesis work

taking care of even the smallest detail.

I tank, especially, Annibale Elia and Emilio D’Agostino, I spent, thanks to them,

my best moments of study during these years.

Tank to Brenda Shaffer, an energy domain expert, that helped us to develop this

project.

And last but not least, I tank all those that I have always maintained, close or

distant, present or absent.

3

Abstract

This work focused on applying semantically enhanced web mining techniques for

building a domain ontology. We mainly analyzed ontology population problem,

because an ontology, to be useful, needs continuously to be updated with new

concepts, relations and lexical resources. There are many methods and several

techniques to do it, but we choose linguistically motivated text mining because

semantic and linguistic information are very important to build an ontology.

According to this point of view we consider an ontology such a lexical taxonomy

created starting by a corpus. In Chapter 2 we selected most relevant works in

literature about linguistically motivated text Mining and knowledge lexical

acquisition methodologies for ontology population, distinguishing lexical

acquisition techniques for automatic or semiautomatic ontology population and

lexical acquisition for ontology population using Keyphrase Extraction (KE)

methodologies. In Chapter 3 we discussed our case study on Energy domain, first

focusing on LAKE (Linguistic Analysis based Knowledge Extractor), a keyphrase

extraction system based on a supervised learning approach that makes use of the

linguistic processing of the documents, we used this KE algorithms for our

experiments, then focusing on lexical acquisition fundaments, reporting, at last,

experiments and results about Energy Domain Ontology. In Chapter 4 concluded

the work with some remarks and proposing future works such Local Grammar

Induction very useful to describe in which manner we use a language, specially in a

specific domain and based on a combination of lexical resources and Part of

Speech Patterns.

Ontology population is a serious problem that we solved using two main

approaches, manual and automatic. First we retrieved 200 documents on the web

using search engines and crawlers building manually the most important part of the

4

energy domain ontology. We compared three different crawlers: Infospiders, Best

First and SS Spider focused on Energy Security Class using queries made with

“class concept + subclass concepts”, (for instance “Energy Security + Reliability of

Supply”); the best one was Infospiders, its results of precision and recall was the

highest. In a successive step we red the most part of documents to manually

building the energy domain ontology. Then we considered many automatic or

semi-automatic techniques for populate the ontology, but in our experiment we

used Lexical Acquisition training LAKE, a keyphrases extraction algorithm, with

25 linguistically annotated documents. At last we compared manual versus

automatic ontology building and the results were quite encouraging, calculating the

average of two measures, precision and recall, respectively of 56% and 40%.

Obviously a manual building of an ontology is more accurate because the human is

an expert lexicographer, but it’s more time consuming and expensive, so an

automatic or semi-automatic approach can be very useful in some steps. In a future

work we will enlarge the corpus for the experiment considering 200 and much

more documents. Successively we want to generate a Local Grammar for energy

domain ontology. It’s a bottom up approach and could help us to describe in which

manner we use language in a specific domain. A statement for a Local Grammar is

composed by lexical resources and part of speech pattern, we can build not only a

controlled vocabulary but also we can better describe local syntactic constraints.

Keywords: energy domain ontology, ontology population, linguistically motivated

text mining, lexical acquisition, Keyphrase Extraction, local grammar.

5

Contents

INTRODUCTION ...................................................................................................... 6

1.1 THE PROBLEM DEFINITION AND THE RESEARCH QUESTION ........................ 11

CHAPTER 2 ............................................................................................................. 13

LITERATURE SURVEY ON LINGUISTICALLY MOTIVATED TEXT MINING AND KNOWLEDGE LEXICAL ACQUISITION METHODOLOGIES FOR ONTOLOGY POPULATION ............. .................... 13

2.1 LEXICAL ACQUISITION FOR AUTOMATIC ONTOLOGY POPULATION ................ 13 2.1.1 Automatic Ontology Population from web documents ........................... 13 2.1.2 Automatic Ontology Population by Googling ........................................ 16

2.2 LEXICAL ACQUISITION FOR SEMI-AUTOMATIC ONTOLOGY POPULATION ....... 18 2.3 LEXICAL ACQUISITION FOR ONTOLOGY POPULATION USING KEYPHRASE

EXTRACTION METHODOLOGY ............................................................................... 22 2.3.1 Keyphrase Extraction Algorithms ........................................................... 23 2.3.2 Relevant Keyphrase Extraction Algorithms ............................................ 30

2.4. LINGUISTIC AND STATISTIC PHRASES: SOME REMARKS ............................. 48

CHAPTER 3 ............................................................................................................. 52

THE ENERGY DOMAIN CASE STUDY ............................................................. 52

3.1 LEXICAL ACQUISITION FOR ONTOLOGY POPULATION ..................................... 61 3.1.1 Ontological organization ........................................................................ 61 3.1.2 Parsing with Case Grammar ................................................................ 64 3.1.3 Lexical Acquisition Fundaments ........................................................... 65 3.1.4 Lexical Acquisition Evaluation ............................................................. 72 3.1.5 The role of lexical acquisition in statistical NLP ................................. 73

3.2 EXPERIMENTS WITH ENERGY DOMAIN ONTOLOGY .................................... 73 3.2.1 Methods and tools ................................................................................. 73 3.2.2 Results ................................................................................................... 79

3.3 DISCUSSION ............................................................................................... 81

CHAPTER 4 ............................................................................................................. 82

CONCLUSIONS ...................................................................................................... 82

4.1 FUTURE WORKS ........................................................................................ 82

APPENDIX ............................................................................................................... 84

KEYPHRASES MANUALLY EXTRACTED ...................................................... 84

BIBLIOGRAPHY .................................................................................................... 96

6

Chapter 1

Introduction

In this pilot study we worked on building domain ontology for the energy domain.

But what is an ontology? It is a philosophic term that concerns being or existence

and its basic categories and relationships, to determine what entities and what types

of entities exist.

In this work, we will treat ontologies as they are used in Computer Science.

Referring to Artificial Science, Semantic Web and Knowledge Representation, we

can find different definitions, for example:

• An ontology is a document or file that formally defines the relations among

terms. The most typical kind of ontology for the Web has a taxonomy and a

set of inference rules (Berners-Lee, 1991).

• Gruber define an ontology such an explicit specification of a

conceptualization. A conceptualization is an abstract, simplified view of the

world that we wish to represent for some purpose (Gruber, 1993).

Web Ontology Language (OWL) has following elements:

• Individuals/Instances

Member of a class. The “ground level” components of an ontology, include

concrete objects such as people, animals, things, etc., or abstract objects

such as numbers and words.

• Classes

Categories created building the ontology. They are abstract groups or

collections of concepts and objects, may contain individuals, other classes

7

or both of them. For instance in our case “energy source” is a class

containing as individuals “oil”, “gas”, etc..

• Relationships/Properties

Binary relations between individuals or classes. They describe a sort of link

between objects in the ontology. The main relation is the subsumption one

expressing an IS A relation, for instance “oil” IS AN “energy source”, but

there are several type of relation. They are three types:

1. Object: to link an individual with a class or with another instance.

2. Datatype: to create an external link with a file XML or an RDF

Schema.

3. Annotation: to add generic information

This formalized knowledge is processable automatically by a machine trough a

reasoner that implements inferential and deductive process.

For many applications, an ontology plays a role in a larger system that can be

designed much like any other complex information system, beginning with a

context of use and requirements analysis. However, for many applications, the

ontology plays the role of mediator for reusable knowledge assets. In such a

situation, the value added by the ontology is only as good as its ability to organize

materials that have not yet been encountered (Allemang, 2006).

Many technologies that now go by the name of "Ontologies" promise a new

paradigm of semantic information sharing. The grandest vision of this is the

Semantic Web, in which the enormous body of data available on the web will be

organized in a way that allows it to be indexed by its meaning, not just by its form

(Allemang, 2006).

8

An ontology is a valid support to decision making process if we consider, for

example, the information need of a web user. Nowadays the WWW is a very

important tool used for working, studying, communicating and for many other

interests. It contains documents, images, and other multimedia resources about

every topic, everything is immediately available on line for everyone having an

Internet connection.

Considering that there are about one milliard of web pages it’s statistically

impossible to retrieve the information in which we are interested at first attempt.

Why can’t we retrieve the correct information on the Web?

It’s simple! Because, nowadays, information on the web are retrieved using search

engines that consider all them equal, since information are created only for human

consumption and aren’t machine readable.

Tim Berners-Lee proposed Semantic Web as solution (Berners-Lee, 2001) a sort

of Web machine readable that uses intelligent agents in order to guide the user to

specific desired information and to help him carrying out some operations

automatically.

Semantic Web uses schemes to describe domain information and so every domain

needs to be described by specific scheme in which meta data map these

information relation to classes or concepts belonging to the domain.

In this way we obtain data structure able to link information each other. Semantic

Web is composed by three levels:

1. Data

Information that describe everything.

9

2. Meta data

Map data representing a concept in to a schema.

3. Ontology

Describes relations between concepts using data classes.

When a user searches information using search engines he sets some queries based

on keywords in order to get relevant information. These keywords may be in fact

concepts and relations in a domain ontology and Semantic Web uses them in order

to create a controlled vocabulary not ambiguous and machine readable.

Ontologies give an explicit conceptualization in order to describe semantic data

using languages syntactically and semantically more rich than a common database,

obtaining not only a domain knowledge representation, but also a specific point of

view about the domain.

Ontologies are created in respect to some main principles:

• Exportability

The system must be independent from application and must be exportable

in other domain.

• Interoperability

It needs to share a common representation of information within a group

using a module structure.

• Semantic Interoperability

In this way, systems, that use different knowledge representation, can

communicate using the ontology.

• Modeling

To explicit inferences of knowledge about a specific domain.

10

The use of ontologies in a Semantic Web view is according to Usability Web

Theories that study the better approach to web pages when users surf on the Web.

Surfing on the Web implies rapid and free eyes movement, pointing to its

importance among designers and users alike. It has also been long established

(Brambring, 1984, Chieko et al. 1998) that this potentially complex and difficult

movement is further complicated, and becomes neither rapid nor free, if the user is

visually impaired.

Harper and Patel (2005) confirmed that people use “Gist” summaries as support to

decision making within their browsing behavior.

Harper and Patel (2005) have investigated four simple summarization algorithms

against each other and a manually created summary; producing empirical evidence

as a formative evaluation. This evaluation concludes that users prefer simple

automatically generated “gist” summaries thereby reducing cognitive overload and

increasing awareness of the focus of the Web-page under investigation.

Users do not read Web pages, they scan them and so, summaries can be important

elements of Web pages to facilitate scanning and browsing. Hence Harper and

Patel (2005) focused on the development of FireFox based tool which creates

summaries of Web pages.

Our work, instead, is based on Keyphrases (also called keywords) Extraction

approach, very similar to summarization technique and useful to succinctly

summarize and characterize documents, providing semantic metadata as well.

11

1.1 The problem definition and the research question

The main problem concerning ontology building is ontology population, because

an ontology, to be useful, needs continuously to be updated with new concepts,

relations and lexical resources. To add new information within the ontology it

needs to retrieve that information in new documents taken, for example, on the

web. But it takes a lot of time, for human, to read every document in order to

extract new information. This work focused on applying semantically enhanced

web mining techniques for semi-automatically building a domain ontology. There

are many methods and several techniques to do it, but we choose linguistically

motivated text mining because semantic and linguistic information are very

important to build an ontology. According to this point of view we consider an

ontology such a lexical taxonomy created starting by a corpus. Using linguistic and

semantic information in an ontology building allow us to create a more complete

structure useful in many applications. For instance we could create an “intelligent”

search engine dedicated to energy domain in which users could retrieve

information using phrases in natural language and not only some more relevant

words as keywords. In this way we could train this search engine to answer in

natural language and to return results using linguistic information.

In Chapter 2 we selected most relevant works in literature about linguistically

motivated text Mining and knowledge lexical acquisition methodologies for

ontology population, distinguishing lexical acquisition techniques for automatic or

semiautomatic ontology population and lexical acquisition for ontology population

using Keyphrase Extraction (KE) methodologies.

12

In Chapter 3 we discussed our case study on Energy domain, first focusing on

LAKE (Linguistic Analysis based Knowledge Extractor), a keyphrase extraction

system based on a supervised learning approach that makes use of the linguistic

processing of the documents, we used this KE algorithms for our experiments, then

focusing on lexical acquisition fundaments, reporting, at last, experiments and

results about Energy Domain Ontology.

In Chapter 4 concluded the work with some remarks and proposing future works

such Local Grammar Induction very useful to describe in which manner we use a

language, specially in a specific domain and based on a combination of lexical

resources and Part of Speech Patterns.

13

Chapter 2

Literature Survey on linguistically motivated text mining

and knowledge lexical acquisition methodologies for

ontology population

To be a useful tool, an ontology needs to be continuously updated with new

concepts, relations and lexical resources. In the following pages we will show

several methods and techniques used to populate a domain ontology. We will

mainly focus on approaches relevant to our experiments on lexical acquisition and

linguistically motivated text mining. For the former, this includes developing

algorithms and statistical techniques for filling the holes in existing machine

readable dictionaries by looking at the occurrence patterns or words in large text

corpora. For the later this includes, survey linguistic approaches to text mining

using more linguistic and semantic information beneficial to populate an ontology.

2.1 Lexical Acquisition for Automatic Ontology Population

2.1.1 Automatic Ontology Population from web documents

(Alani et al., 2003) developed Artequakt a system that automatically extracts

knowledge about artists from the Web, populates a knowledge base, and uses it to

generate personalized biographies. Artequakt connects a knowledge-extraction tool

with an ontology to get knowledge support from the information extracted. The

extraction tool searches for documents on the web and extracts knowledge, then

compares those results with the given classification structure. This knowledge is

14

converted in a machine-readable format that will be automatically stored in a

knowledge base (KB). A lexicon-based term expansion mechanism is used to

increase knowledge extraction and gives extended ontology terminology.

Artequakt’s architecture (figure 1) includes three areas. First, knowledge extraction

tools assemble information items along with sentences and paragraphs from Web

documents selected manually or obtained automatically by a search engine. The

tools deliver the information fragments to the ontology server along with metadata

derived from the vocabulary. To identify knowledge fragments Artequakt uses

WordNet (www.cogsci.princeton.edu/~wn), a general-purpose lexical database,

and GATE (General Architecture for Text Engineering, http://gate.ac.uk), an entity

recognizer. These tools allow Artequakt to obtain knowledge consisting of not just

entities but also the relations between them.

Figure 1. The Artequakt architecture.

(Alani et al., 2003)

15

The Knowledge Extraction (KE) module automatically populates the ontology with

information extracted from online documents on the basis of the given ontology’s

representations and WordNet lexicons. Information Extraction tools can recognize

entities in documents like “Rembrandt”, a person, or “15 July 1606”, a date.

However, such information isn’t very useful without the relation between these

entities that is, “Rembrandt was born on 15 July 1606”. Extracting such relations

automatically allowed them acquiring more complete knowledge to populate the

ontology. Artequakt attempts to identify entity relationships using ontology relation

declarations and lexical information.

Second, the ontology server stores and consolidates the information so that the

biography generation tool can query the KB using an inference engine. Storing

information in a structured KB supports different knowledge services as that

reconstructing the original source material to produce a dynamic presentation

tailored to user needs. After these steps the system has to populate ontologies with

many high-quality instantiations in order to get valuable and consistent ontology

based knowledge services, adding information into the KB following the ontology

domain representation.

Third, the Artequakt server takes user requests to generate narratives through a

simple web interface. The user might request a particular style of biography, such

as a chronology or summary, or a specific focus such as the artist’s style or body of

work. The server uses story templates to render a narrative from the KB.

The ontology server used for this project uses Java sockets and connects to the

Artequakt KB through the Protégé application programming interface. An

inference engine built on this server allows querying the KB to retrieve specific

16

information. Artequakt’s ontology server sends some extracted knowledge to a

relational database for quick access to frequently used information through SQL

queries when generating biographies.

2.1.2 Automatic Ontology Population by Googling1

Automatic ontology population using Google.com search engine is a method,

inspired by Hearst (1992), to populate ontologies with the use of googled text

fragments (Gijs Geleijnse and Jan Korst, 2005). Consisting of patterns that express

arbitrary relations between instances of classes.

Hearst patterns (Hearst, 1992) are lexico-syntactic patterns that indicate the

existence of class/subclass relation in unstructured data source, e.g. web pages. In

according to Hearst, given two terms t1 and t2 we are able to record how many

times some of these patterns indicate an “is-a” relation between them, identifying

hypernyms-hyponyms relations.

Although these patterns occur quite rarely in unstructured data, they provide

reliable and valuable information.

The method is based on hand-crafted patterns which are tailor-made for the classes

and relations considered. These patterns are queried to Google.com, where the

results are scanned for new instances. Instances found can be used within these

patterns as well, so the algorithm can populate an ontology based on a few

instances in a given partial ontology. The algorithm they describe uses instances of

some classes returned by Google to find instances of other classes. For each

1 Googling is a neologism that means to search information on the web using Google.com like search engine.

17

pattern, which represent a binary relation between two objects A and B: A-relation-

B, they can formulate two Google queries: A-relation and relation-B.

Because Google.com retrieves web pages, not the information itself, the authors to

developed a method to automatically extract information from the web using a

wrapper algorithm that crawls a large website and makes use of the homogeneous

presentation of the information on its pages. When instances are denoted on exactly

the same place on each similar page within a website, it is easy to extract them and

to update the ontology.

Gijs Geleijnse and Jan Korst (2005) identify instances by Google’s support. After

extracting a term, they check in order to find out whether the extracted term is

really an instance of the class. They search phrases in Google search engine that

express the term-class relation. Again, these phrases can be constructed

semiautomatically. Hearst-patterns are candidates as well for this purpose. A term

is accepted as instance when the number of hits of the queried phrase is upon a

given threshold.

To test their system Gijs Geleijnse and Jan Korst (2005) have selected a small

partial ontology of the movies domain. They identified three classes, of which only

the class Director has instances. In doing so they found movies directed by these

directors. The movies found were used to find starring actors that represent the

basis of the search for other movies in which they played. The process continues

until no new instances can be found.

With a starting set of only two directors, they have found a well-populated

ontology with not only directors, but also actors and movie titles.

18

2.2 Lexical Acquisition for Semi-Automatic Ontology

Population

Maria Vargas-Vera and David Celjuska (2004) describe a system for semi-

automatic population of ontologies that extracts instances from unstructured text. It

learns extraction rules from annotated text and then applies them on new articles to

populate the ontology.

The work aims to accomplish three purposes:

1. identifying key entities in text articles that could participate in ontology

population with instances. The system identifies important entities in a

document, put them as values in a slot v1, v2, . . . , vNi, and thus constructs

an instance composed of these features Ii = (v1, . . . , vNi) in the given

ontology O. For example in the class “Conferring an Award” some possible

slots of values are “has a duration”, “has a location”, “has an awarding body

(an organization)”, etc.;

2. identifying of the most probable classes for the population based on newly

introduced confidence values;

3. semi-automatically populating an ontology with those instances.

The system consists of the following 3 steps:

1. Texts Annotation

The system, using a supervised learning, is trained to learn extraction rules with a

set of examples, consisting of a set of documents annotated with XML tags and

assigned to one of the predefined classes within the ontology O. Each slot within

the ontology is assigned a unique XML tag – the mark-up step is ontology driven.

19

Once the user identifies a desired class for a displayed document from the ontology

he is only offered relevant tags for the annotation.

2. System Learning divided in turn into 3 steps

• Natural Language Processing (NLP)

Ontosophie uses shallow parsing to recognize syntactic constructs without

generating a complete parse tree for each sentence. The shallow parsing has the

advantages of higher speed and robustness. High speed is necessary to apply the

Information Extraction (IE) to a large volume of documents. The robustness

achieved by using a shallow parsing is essential to deal with unstructured texts. In

particular, Ontosophie uses the Marmot2, an NLP system that accepts ASCII files

and produces an intermediate level of text analysis that is useful for IE

applications. Sentences are separated and segmented into noun phrases, verb

phrases and other high-level constituents.

After each document has been annotated and pre-processed with the NLP tool, the

generating extraction rules takes.

• Generating extraction rules

This phase makes use of Crystal3, a conceptual dictionary induction system

(Soderland at al., 1995).

Crystal derives a dictionary of concept nodes and extraction rules from a training

corpus based on the specific-to-general algorithm. For instance, an extraction rule

2 Marmot was developed at University of Massachusetts, MA, USA.

3 Crystal was developed at University of Massachusetts, MA, USA.

20

might be understood as following: conferring-an-award: <VB-P “been awarded”>

<OBJ1 ANY> <PP “by” has-awarding-body> Coverage: 5 Error: 1.

The selected rule’s purpose is to extract conferring-an-award, which refers to the

name of a class in the ontology. This extraction rule is aimed at extracting has-

awarding-body – name of a donor.

The rule fires only if all the constraints are satisfied. This means, that the entity

conferring-an-award is extracted from any sentence or its part only in the case

where it consists of “has been awarded” as passive verb (VB-P), an object (OBJ1)

that might be anything and it contains a prepositional phrase (PP), which starts with

preposition “by”. When this rule fires then the prepositional phase (PP) is extracted

as has-awarding-body. In addition, Crystal provides two values: Coverage and

Error. In this particular example they state that the rule covers five instances (one

incorrectly) in the corpus in which the rule was generated from.

• Assigning rule confidence values to extracted rules

Another step is assigning rule confidence values to extracted rules.

Experimentation showed that some extraction rules that were learnt by Crystal are

very weak and therefore firing too often, while other rules might be overly specific.

In addition, previous experiments (Riloff, 1996) showed that precision improves if

those rules are manually removed. However, their goal was to take an automatic

control over this and to eliminate rules with low rule confidence. For this purpose

Ontosophie attaches a rule confidence value to each rule. The rule confidence

expresses how sure the system is about the rule itself.

21

Ontosophie has two ways of computing the rule confidence value. The first method

uses the Coverage and Error values provided by Crystal. The rule confidence for

the rule r i is computed as Cri = cri/nri = Coverageri−Error ri / Coverageri. Where cri is

the number of times the rule r i has fired correctly and nri is the number of times the

rule is fired in total. However, this does not distinguish between, for example C2 =

(2 − 0)/2 and C10 = (10 − 0)/10, because C2 = C10 = 1.0. But C10 is more accurate

and has higher support, because in this case the rule fired ten times out of ten

correctly, while the other fired only two times correctly out of two. This is why

Ontosophie uses Laplace Expected Error Estimate (Clark and Boswell, 1991)

defined as 1, LaplaceAccuracy, where LaplaceAccuracy = nc+1/ntot+k and nc is the

number of examples in the predicted class covered by the rule, ntot is the total

number of examples covered by the rule and k is the number of classes in the

domain. Implementing the Laplace accuracy the valuation of confidence is then Cri

= cri+1/nri+2 and k = 2 because it deals with two classes for each rule. One, the

rule fires and two, the rule does not fire.

One might note that if Cri = 0.5 the rule fires correctly as often as it does

incorrectly and so it should be eliminated.

The second method computes confidence for each rule by the k-Fold Cross

validation methodology (Mitchell, 1997) on the training set. At each run a new set

of extraction rules is generated by Crystal. This algorithm (Celjuska, 2004)

computes for each rule r i how many times it fired correctly cri , how many times it

fired in total nri, performs merging of identical rules and assigns xri to each rule that

tells how many times the rule was merged. At each run, after all the rules have

22

been generated by Crystal, Ontosophie enters evaluation state which is based on the

extraction in order to recognize whether an extracted entity is correct or not.

3. Extraction and ontology population

The system extracts appropriate entities from an article and feed a newly

constructed instances in order to populate the ontology. The extraction is run class

by class. Firstly, a set of extraction rules for only one specific class from the

ontology is taken and only those rules are used for the extraction. The step is then

repeated for all the classes within the ontology and thus for each class the system

gets a set of entities.

It might happen that the extraction component extracts more than one value for a

given slot name. This is the collision that has to be solved.

If more than one rule extracts the same entity, then Cvalue is computed as the

maximum overall confidences of rules that fired it. The same applies for the slot

confidence Cslot.

However, if more then one value was extracted for a slot then only the value with

its highest confidence is considered and also pre-selected.

2.3 Lexical Acquisition for Ontology Population using

Keyphrase Extraction methodology

An important approach useful to do lexical acquisition and populate an ontology is

Keyphrase Extraction (KE) (D’Avanzo, 2005). A Keyphrase is a “textual unit

usually larger than a word but smaller than a full sentence” (Caropreso, 2002).

From an operative perspective, keyphrases represent a useful way to succinctly

23

summarize and characterize documents, providing semantic metadata as well. The

term “syntactic phrase”, instead, denotes any phrase that is so according to the

grammar of the language under consideration. A “statistical phrase is any sequence

of words that occurs contiguously in a text” (Caropreso, 2002). In the following,

different kinds of keyphrases are analyzed, depending on the approach proposed

they range from statistical based keyphrases, as those used by Turney’s (2002) and

Witten’s (1999), to more linguistically based keyphrases as those proposed by

Hulth (2003). They do not only work as brief summaries of a document’s contents

as Turney pointed out (Turney, 1999), but they can be used in information retrieval

systems “as descriptions of the document returned by a query, as the basis for

search indexes, as a way of browsing a collection, and as a document clustering

technique” (Turney, 1997). Even though keyphrases have a great diffusion in the

context of journal articles, many other types of documents could benefit from the

use of keyphrases, including Web pages, email messages, news reports, magazine

articles, and business papers. Although the potential benefit of keyphrases is large,

the vast majority of documents are not currently furnished with keyphrases due to

impracticality of assigning them manually.

2.3.1 Keyphrase Extraction Algorithms

A common approach used by the systems below concerns the pre-processing part

(Turney, 1997, Jones et al. 2002). This step is directly related to the choice of

potential keyphrases and consists of an input cleaning, a phrase identification, and

a stemming (Salton, 1988).

24

Pre-Processing Stage

It consists of different steps, that are:

• Input cleaning

• Phrase identification

• Case-folding and stemming

Input cleaning

The input stream (usually an ASCII file) is split into tokens (sequences of letters,

digits and internal periods), and then several modifications are made:

� punctuation marks, brackets, and numbers are replaced by phrase

boundaries;

� apostrophes are removed;

� hyphenated words are split in two;

� remaining non-token characters are deleted, as are any tokens that do not

contain letters.

The result is a set of lines, each a sequence of tokens containing at least one letter.

Part I – Phrase identification

In this stage a set of heuristic founded rules identify the phrases. The rule used by

the algorithms below are:

� Candidate phrases are limited to a certain maximum length (usually three

words).

� Candidate phrases cannot be proper names (i.e. single words that only ever

appear with an initial capital).

� Candidate phrases cannot begin or end with a stopword.

25

The stopword list contains 425 words in nine syntactic classes (conjunctions,

articles, particles, prepositions, pronouns, anomalous verbs, adjectives, and

adverbs). For most of these classes, all the words listed in an on-line dictionary

were added to the list. All contiguous sequences of words in each input line are

tested using the three rules above, yielding a set of candidate phrases.

Case-folding and stemming

An important step in determining candidate phrases is to case-fold all words and

stem them using the iterated Lovins method. This involves using the classic Lovins

stemmer (Lovins, 1968) to discard any suffix, and repeating the process on the

stem that remains until there is no further change. So, for example, the phrase cut

elimination becomes cut elim.

Stemming and case-folding allow us to treat different variations on a phrase as the

same thing. For example, proof net and proof nets are essentially the same, but

without stemming they would have to be treated as different phrases.

Part II – Candidate Selection

Machine Learning (ML) is an area of research, spawning a number of different

problems and algorithms for their solutions. Algorithms vary in their goals, in the

available training data sets, and in the learning strategies and representation of data.

All of these algorithms, however, learn by searching through an n-dimensional

space of a given data set to find an acceptable generalization (Mitchell, 1997,

Quinlan, 1986).

26

The most fundamental machine-learning tasks is inductive machine learning where

a generalization is obtained from a set of samples, and it is formalized using

different techniques and models. Inductive learning can be defined as “the process

of estimating an unknown input-output dependency or structure of a system, using

limited number of observations or measurements of inputs and outputs of the

system” (Kantardzic, 2003).

An inductive-learning machine tries to form generalization from particular, true

facts, known as the training data set. These generalizations are formalized as a set

of functions that approximate the system’s behavior, requiring a priori knowledge

in addition to data. All inductive-learning methods use a priori knowledge in the

form of the selected class of approximating functions of a learning machine

(Kantardzic, 2003, Mitchell, 1997).

In general, the learning machine is able to implement a set of functions f (X,w),

with w W, where X is an input, w is a parameter of the function, and W is a set

of abstract parameters used only to index the set of functions. In this formulation,

the set of functions implemented by the learning machine can be any set of

functions. Ideally, the choice of a set of approximating functions reflects a priori

knowledge, about the system and its unknown dependencies. In practice, because

of the complex and often informal nature of a priori knowledge, specifying such

approximating functions may be, in many cases, difficult or impossible (D’Avanzo,

2005).

There are two common types of the inductive-learning methods known as

• supervised learning (or learning with a teacher)

• unsupervised learning (or learning without a teacher).

27

Supervised learning is used to estimate an unknown dependency from known

input-output samples. Supervised learning assumes the existence of a teacher,

fitness function or some other external method of estimating the proposed model.

The term “supervised” denotes that the output values for training samples are

known (i.e., provided by a “teacher”).

Conceptually speaking, the teacher has knowledge of the environment, with that

knowledge being represented by a set of input-output examples. The environment

with its characteristics and model is, however, unknown to the learning system.

KE algorithms discussed by D’Avanzo (2005) make use of a supervised learning

algorithm.

In the context of the supervised learning, KE task is treated as a classification

problem. In ML terms, this means that it exists a learning function that classifies a

data item into one of several predefined classes. The training stage uses a set of

training documents for which the authors keyphrases are known. For each training

document, candidate phrases are identified and their feature values are calculated.

Each phrase is then marked as a keyphrase or a non-keyphrase using the actual

keyphrases for that document. This binary feature is the class feature used by the

machine learning scheme.

In this approach D’avanzo (2005) used a Naïve Bayes as learning device. It learns

two sets of numeric weights (TF × IDF and first occurrence) from the discretized

feature values, one set applying to positive (is a keyphrase) examples and the other

to negative (is not a keyphrase) instances. In this way every new sample, even

without a known output (the class to which it belongs), may be classified correctly.

28

The Bayesian method provides a principled way to incorporate external

information into the data-analysis process. This process starts with an already given

probability distribution for the analyzed data set. As this distribution is given

before any data is considered, it is called a prior distribution. The new data set

updates this prior distribution into a posterior distribution. The basic tool for this

updating is the Bayes Theorem. The Bayes Theorem is a theoretical model for a

statistical approach to inductive-inferencing classification problems.

TF ×××× IDF

This feature compares the frequency of a phrases use in a particular document with

the frequency of that phrase in general use. General usage is represented by

document frequency, the number of documents containing the phrase in some large

corpus. A phrases document frequency indicates how common it is (and rarer

phrases are more likely to be keyphrases). KEA, for example, is a KE algorithm

and builds a document frequency file for this purpose using a corpus of about 100

documents. Stemmed candidate phrases are generated from all documents in this

corpus.

The document frequency file stores each phrase and a count of the number of

documents in which it appears. With this file in hand, the TF × IDF for phrase P in

document D is:

where:

• freq (P,D) is the number of times P occurs in D

29

• size (D) is the number of words in D

• df (P) is the number of documents containing P in the global corpus

• N is the size of the global corpus

The second term in the equation is the log of the probability that this phrase

appears in any document of the corpus (negated because the probability is less than

one). If the document is not part of the global corpus, df (P) and N are both

incremented by one before the term is evaluated, to simulate its appearance in the

corpus.

First occurrence

First occurrence is calculated as the number of words that precede the phrases first

appearance, divided by the number of words in the document.

The result is a number between 0 and 1 that represents how much of the document

precedes the phrases first appearance.

Discretization

Both features are real numbers and must be converted to nominal data for the

machine-learning scheme. During the training process, a discretization table for

each feature is derived from the training data. This table gives a set of numeric

ranges for each feature, and values are replaced by the range into which the value

falls. Discretization is accomplished using the supervised discretization method

described in Fayyad et al (1993).

30

2.3.2 Relevant Keyphrase Extraction Algorithms

GenEx

Turney (1999) has been the pioneer in using the methodology based on the

supervised learning, with GenEx, an algorithm for KE. GenEx has two

components, the Genitor genetic algorithm and the Extractor keyphrase extraction

algorithm. Genitor main function is the tuning of the features of C4.5 decision tree

algorithm (Quinlan, 1986, 1983).

Extractor takes a document as input and produces a list of keyphrases as output.

Extractor has twelve parameters that determine how it processes the input text. In

GenEx, the parameters of Extractor are tuned by the Genitor genetic algorithm to

maximize performance (fitness) on training data. Genitor is used to tune Extractor,

but Genitor is no longer needed once the training process is over.

When the parameter values are known, Genitor is discarded. Thus the learning

system is called GenEx (Genitor plus Extractor) and the trained system is called

Extractor (GenEx minus Genitor).

The performance is measured by the number of matches between the machine-

generated phrases and the human-generated phrases. In particular a precision

measure is employed (the number of matches divided by the number of machine-

generated keyphrases), using a variety of cut-offs for the number of machine-

generated keyphrases. A human-generated keyphrase matches a machine-generated

keyphrase when they correspond to the same sequence of stems. A stem is what

remains when we remove the suffix from a word. By this definition, neural

networks matches neural network, but it does not match networks. The order in the

sequence is important, so helicopter skiing does not match skiing helicopter.

31

The experiments are based on five different document collections. For each

document, there is a target set of keyphrases, generated by hand. The average

precision obtained by Extractor on the five document collection is 29%. Turney

(1999) supported the usefulness of a human evaluation: “It is not obvious whether a

precision of, say, 29% for five phrases is good or bad. We believe that it is useful

to know that one algorithm has a precision of 29% (for a given corpus and a given

desired number of phrases) while another algorithm has a precision of 15% (for the

same corpus and the same number of phrases), but a precision of 29% has no

significance by itself. What we would really like to know is, what percentage of the

keyphrases generated by GenEx are acceptable to a human reader?”.

To this end, an on-line demonstration of GenEx has been created on the web. The

demonstration allows the user to enter any URL for processing. The software then

downloads the HTML at the given URL and sends it to the Extractor. The

keyphrases are shown to the user, who may then rate each keyphrase as good or

bad. Some or all keyphrases may be left unrated (in Turney, 1999 these are called

no opinion). The number of keyphrases is fixed at seven, to keep the interface

simple. Turney interprets the data as supporting the hypothesis that about 80% of

the keyphrases are acceptable (acceptable meaning not bad).

KEA

Witten et al. (1999) implemented their methodology in KEA, an algorithm for

automatically extracting keyphrases from text. KEA identifies candidate

keyphrases using the pre-processing as described above, calculates feature value

32

for each candidate, and uses a machine learning algorithm to predict which

candidates are good keyphrases.

The Naïve Bayes machine learning scheme first builds a prediction model using

training document with known keyphrases, and then uses the model to find

keyphrases in new documents. Two features are calculated for each candidate

phrase and used in training and extraction. They are: TF × IDF, a measure of

phrase’s frequency in a document compared to its rarity in general use; and first

occcurrence, which is the distance into the document of the phrase’s first

appearance. KEA’s effectiveness has been assessed by counting the keyphrases that

were also chosen by the document’s author, when a fixed number of keyphrases are

extracted (the same measure used by Turney, 1999). The average precision is about

20%.

KPSpotter

KPSpotter is an algorithm implementing the methodology proposed by Song et al.

(2003). The algorithm employs a technique that combine Information Gain, a data

mining measure technique introduced in ID3 algorithm (Quinlan, 1993), after

classical pre-processing has been applied.

In this sense KPSpotter presents some resemblances with Extractor; both

algorithms, in fact, use a learner belonging to the same family, that is, the decision

trees (Quinlan, 1996, 1986, 1993).

The outcomes of both learning and extraction stages performed by KPSpotter are

stored in a XML form. The following two features were selected for training and

extracting keyphrases: TF × IDF, and first occurrence, like KEA. Moreover,

33

KPSpotter is able to process various types of input data such as XML, HTML,

unstructured text data and generate XML as an output. For efficiency and

performance reason, KPSpotter stores candidate keyphrases and its related

information such as frequency and stemmed form into an embedded database

management system. The performance of KPSpotter, like Extractor and KEA, was

measured by comparing the number of matches of keyphrases that the author

assigned. To this end, the same training and test data employed in the KEA’s

assessment have been used. Designers of KPSpotter argue that according to their

preliminary experiments the quality of keyphrases extracted by their algorithm is

equivalent to KEA’s.

Hulth’s Approach

The algorithms analyzed share both the learning approach and the pre-processing

(with negligible differences). All systems used only a “shallow” linguistic analysis

(e.g. tokenization, stemming) (Salton, 1988). In the following D’Avanzo (2005)

discussed the approach proposed by (Hulth, 2003), a keyphrase extraction

algorithm that exploits a supervised learning algorithm improved by embedding

more linguistic knowledge.

In her work (Hulth, 2003) Hulth tested three methodologies to find out the

candidate phrases:

1. n-gram approach: in a manner similar to Turney (2000) and Frank et al.

(1999) all unigrams, bigrams, and trigram were extracted. Thereafter a

stoplist was used where all terms beginning or ending with a stopword were

34

removed. Finally all remaining tokens were stemmed using Porter’s

stemmer.

2. Chunking approach: a partial parser4 was used to select all NP-chunks from

the documents. This choice was motivated by inspecting manually assigned

keywords and observing that the vast majority turn out to be nouns or noun

phrases with adjectives. Hulth (2003) argued that this setting seems to

better to capture the idea of keywords having a certain linguistic property.

3. Pattern Approach: A set of Part Of Speech tag patterns, in total 56, were

defined, and all (part-of-speech tagged) words or sequences of words that

matched any of these were extracted. The patterns were those tag sequences

of the manually assigned keywords, they are present in the training data,

that occurred ten or more times.

Four features have been used in the experiments, they are:

• Within-document frequency.

• Collection frequency.

• Relative position of the first occurrence (the proportion of the document

preceding the first occurrence).

• POS tag or tags assigned to the term by the same partial parser used for

finding the chunks and the tag patterns. When a term consists of several

tokens, the tags are treated like a sequence (e.g. an extracted phrase like

random JJ excitations NNS gets the atomic feature value JJ_NNS).

4 LT CHUNK, available at http://www.ltg.ed.c.uk/software/pos/index.html.

35

The learner used for the experiments is the rule induction5 algorithm, i.e. the model

that is constructed from the given examples, and consists of a set of rules. The

measure used to evaluate the results on the validation set was the F-score, defined

combining the precision and the recall obtained by:

In this study, the main concern is the precision and the recall for the examples to

which the class positive have been assigned, i.e. how many of the suggested

keyphrases are correct (precision), and how many of the manually assigned

keyphrases that are found (recall). As the proportion of correctly suggested

keyphrases is considered equally important as the amount of terms assigned by a

professional indexer that was detected, was assigned the value 1, thus giving both

to precision and recall equal weights.

If we first consider the term selection approaches, extracting NP-chunks gives a

better precision, while extracting all words or sequences of words matching any of

a set of POS tag patterns gives a higher recall compared to extracting n-grams

(D’Avanzo, Elia et al. 2007). Table 1 shows the results. The highest F-score is

obtained by one of the n-gram runs. The largest amount of assigned terms present

in the abstracts are assigned by the pattern approach without the tag feature. As for

when syntactic information is included as a feature (in the form of the POS tag(s)

assigned to the term), it is evident from the results discussed above that linguistic

information is beneficial, in this particular evaluation framework, for assigning an

5 http://www.compumine.com.

36

acceptable number of terms per document, independent of what term selection

strategy is chosen (D’Avanzo, 2005).

Table 1: Results obtained using different approaches. Adapted from Hulth (2003)

LinkIT

Finding potential terms by means of PoS tagger is not a new approach. Evans et al.

(2000) used a linguistically-motivated technique for the recognition and grouping

of simplex noun phrases (SNPs) called LinkIT. The system has two key features:

1. it gathers minimal NPs, i.e. SNPs, as precisely and linguistically defined

and motivated;

2. it applies a refined set of postprocessing rules to these SNPs to link them

within a document.

The identification of SNPs is performed using a finite state machine compiled from

a regular expression grammar, and the process of ranking the candidate significant

topics uses frequency information that is gathered in a single pass through the

document. In evaluating the NP identification component of LinkIT it has been

found that it outperformed other NP chunkers in precision and recall. The system is

currently used in several applications, such as web page characterization and multi-

document summarization.

37

Barker’s Approach

Barker (2000) describes a system for choosing noun phrases from a document as

keyphrases. A noun phrase is chosen based on its length, its frequency and the

frequency of its head noun. Noun phrases are extracted from a text using a base

noun phrase skimmer and an off-the-shelf online dictionary. Experiments involving

human judges revealed the following results: the simple noun phrase-based system

performs roughly as well as a state-of-the-art, corpus-trained keyphrase Extractor;

ratings for individual keyphrases do not necessarily correlate with ratings for sets

of keyphrases for a document; agreement among unbiased judges on the keyphrase

rating task is poor.

LAKE

Linguistic Analysis based Knowledge Extractor (LAKE) (D’Avanzo, 2005) is a

keyphrase extraction system. LAKE is based on a supervised learning approach

that makes use of the linguistic processing of the documents. The system works as

follows: first, a set of linguistically motivated candidate phrases is identified. Then,

a learning device chooses the best phrases. Finally, keyphrases at the top of the

ranking are merged to form a summary, or, in general, an index, depending on the

application. Treating the automatic keyphrase extraction as a supervised machine

learning task means that a classifier is trained by using documents annotated with

known keyphrases. The trained model is subsequently applied to documents for

which no keyphrases are assigned: each defined term from these documents is

classified either as a keyphrase or as a non-keyphrase. Both processes choose a set

of candidate keyphrases (i.e. potential terms) from their input document, and then

38

calculate the values of the features for each candidate. Two important issues are

how to define the potential terms, and which features of these terms are considered

discriminative. Like KEA, it uses Naïve Bayes as the learning algorithm, and TF

× IDF term weighting scheme and the first occurrence of a phrase as features.

Unlike Kea and Extractor, LAKE chooses the candidate phrases using linguistic

knowledge (D’Avanzo, 2004, 2005).

The candidate phrases generated by LAKE are sequences of Part of Speech

containing Multiword expressions and Named Entities. Then they are defined as

“patterns” and stored in a patterns database6; once there, the main work is done by

the learner device. The linguistic database makes LAKE unique in its category.

LAKE has three main components: Linguistic Pre-Processor, Candidate Phrase

Extractor and Candidate Phrase Scorer. LAKE accepts a document as an input7.

The document is processed first by Linguistic Pre-Processor which tags the whole

document, identifying Named Entities and Multiwords as well. Then candidate

phrases are identified based on the pattern database (Candidate Phrase Extractor

module in figure 2). Table 2 contains an example of candidate phrases identified

with this procedure about energy domain.

Table 2: Examples of types of phrases and their patterns

6 Patterns consist of sequences of Part of Speechs containing Named Entities and Mutiwords.

7 The system has been designed with different pre-processing modules allowing to process

different format: txt, html, xml.

39

Type of Phrase

Pattern Example Head unit

Bi-Gram AN NN

affordable energy energy conservation

N (energy) N (conservation)

Tri-Gram NPN ANN VPN APN NCN

demand for energy fossil fuel dependency to invest in alternatives vulnerable to shortages disruptions & vulnerability

N (demand) N (dependency) V (to invest) A (vulnerable) N (distr. – vuln.)

Four-Gram ANPN NPAN ANNN AANN NPNN

wasteful use of resources dependence on foreign oil liquid transportation fuels crisis unstable foreign oil supplier cost of oil dependence

N (use) N (dependence) N (crisis) N (supplier) N (cost)

Fifth-Gram ANPNN economic cost of oil dependence

N (cost)

Sixth-Gram ANPNNN macroeconomic cost of oil market disruptions

N (cost)

Seventh-Gram NPNPDNN speed of changes in the climate system

N (speed)

Tenth-Gram DNPAANPNNN The world’s second largest emitter of greenhouse gas emissions

N (emitter)

Up to now the process is the same for training and extraction stages. In training

stage, however, the system is furnished with annotated document8. Candidate

Phrase Scorer module is equipped with a procedure which looks, for each author

supplied keyphrase, for a candidate phrase that could be matched, identifying

positive and negative examples. The model that come out from this step is, then,

used in the extraction stage. LAKE has been extended for Multi-document

summarization purposes.

Again it has been exploited the KE ability of the system, adding, however, a

sentence extraction module able to extract a 250 word summary from a cluster of

8 For which keyphrases are supplied by the author of the document.

40

documents. The module once extracted keyphrases for each document uses a score

mechanism to select the most representative keyphrases for the whole cluster. Once

identified this list, the module selects the sentences which contains these

keyphrases.

LAKE is based on three main components: the Linguistic Pre-Processor, the

Candidate Phrase Extractor and the Candidate Phrase Scorer. In the following

sections there is a detailed description of the system.

Linguistic Pre-Processor

Every document is analyzed by the Linguistic Pre-Processor in the following three

consecutive steps: Part of speech analysis, Multiword recognition and Named

Entity Recognition.

1. Part of Speech Tagger

Usually linguistics groups the words of a language into classes which show similar

syntactic behavior. These classes represent the parts of speech (POS), known also

as syntactic or grammatical categories. Three important parts of speech are noun

(N), verb (V), and adjective (A). Nouns typically refer to people, animals, concepts

and things. The verb is used to express the action in a sentence. Adjectives describe

properties of nouns. In the following there are two possible tagged sentences

associated with the ambiguous sentence about visiting aunts.

Visiting/ADJ aunts/N-Pl can/AUX be/V-inf a/DET-Indef nuisance/N-Sg

Visiting/V-Prog aunts/N-Pl can/AUX be/V-inf a/DET-Indef nuisance/N-Sg

In the first sentence, “visiting” is an adjective that modifies the subject “aunts”. In

the second sentence, it is a gerund that takes “aunts” as an object.

41

The example shows that words may be assigned multiple POS tags, and the role of

the tagger is to choose the correct one. In the “aunts” example there is not enough

information in the sentence to decide between the two tags.

There are two main approaches to POS tagging: rule-based and stochastic.

A rule-based tagger tries to apply some linguistic knowledge to rule out sequences

of tags that are syntactically incorrect.

A rule, for instance, may be like this:

If an unknown term is preceded by a determiner and followed by a noun, then label

it an adjective.

While some rule-based taggers are entirely hand-coded, others leverage from

training procedures on tagged corpora (Brill, 1994).

On the other hand, stochastic taggers also rely on training data, based, however, on

frequency information or probability to disambiguate tag assignments. In its

simplest version a stochastic tagger, for instance, disambiguates words based on the

probability that a word occurs with a particular tag. This probability is typically

computed from a training set, in which words and tags have already been matched

by hand. Tageers based upon Hidden Markov Models or Maximum Entropy

represent more advance stochastic versions (Charniak, 1993).

42

Figure 2. The system architecture

The POS tagger adopted by LAKE is the TreeTagger, developed at the University

of Stuttgart (Schmid, 1994). The TreeTagger uses a decision tree to obtain reliable

estimates of transition probabilities. It determines the appropriate size of the

context (number of words) which is used to estimate the transition probabilities.

For example, if we have to find the probability of a noun appearing after a

determiner followed by an adjective we find out whether the previous tag is ADJ; if

43

yes, then we go into the “yes” branch and check if the tag previous to this was a

determiner; if “yes” then we get to a probability of this occurrence.

2. Multiwords Recognition

The task of Multiwords Recognition (MR) is strictly related to Automatic Term

Recognition (ATR). MW based on MultiWordNet is a lingustic approach to MR.

However, statistical approaches coming from the areas of collocation extraction

and IR exist. Researchers on MR and ATR seem to agree that multiword terms are

mainly noun phrases, but their opinions differ on the type of noun phrases they

actually extract. Most systems rely on syntactic criteria and do not use any

morphological processes. An exception is Damerau’s work (1993). Justeson et al.

(1995) work on noun phrases, mostly noun compounds, including compound

adjectives and verbs albeit in very small proportions. They use the following

regular expression for the extraction of noun phrases

((Adj|Noun)+|((Adj|Noun)*(Noun-Prep)?)(Adj|Noun)*)Noun

They incorporate the preposition of, showing however, that when of is included in

the regular expression, there is a significant drop on precision (this drop is too high

to justify the possible gains on recall). Their system does not allow any term

modification. Daille et al. (1994) also concentrate on noun phrases. Term formation

patterns for base Multi-Word Unit (base-MWU), consist mainly of 2 elements

(nouns, adjectives, verbs or adverbs).

The patterns for English are:

• Adj Noun

• Noun Noun

44

Bourigault (1992) also deals with noun phrases mainly consisting of adjectives and

nouns that can contain prepositions, and hardly any conjugated verbs. He argues

that terminological units obey specific rules of syntactic formation. His system

does not extract only terms. Dagan et al. in (1994) claim that noun phrases

extracted consist of one or more nouns that do not belong to a stoplist.

Frequency of occurrence of the potential multiword term is the most common

statistics used.

In LAKE sequences of words that are considered as single lexical units are detected

in the input document according to their presence in WordNet (Figenbaum et al.,

1963, Pianta et al., 2002). For instance, for energy domain, the sequence fossil fuels

is transformed into the single token fossil fuel and the PoS tag found in WordNet is

assigned to it.

3. Named Entities Recognition

The task of Named Entity Recognition (NER) requires a program to process a text

and identify expressions that refer to people, places, companies, organization,

products, and so forth. Thus the program should not merely identify the boundaries

of a naming expression, but also classify the expression, e.g., so that one knows

that Rome refers to a city and not a person.

NER is a subtask of Information Extraction. Different NER systems were evaluated

as a part of the Sixth Message Understanding Conference in 19959. The target

language was English. The participating systems performed well. However, many

of them used language-specific resources for performing the task and it is unknown

9 http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html

45

how they would have performed on another language than English (Palmer et al.,

1997). Since 1995 NER systems have been developed for some European

languages and a few Asian languages. Palmer et al. (1997) and (Cucerzan et al.,

1999) have been two seminal work that have applied one NER system to different

languages.

CoNLL-200210 concerned a language-independent named entity recognition

campaign, where the attention was concentrated on four types of named entities:

persons, locations, organizations and names of miscellaneous entities that do not

belong to the previous three groups.

The participants, who were offered training and test data for at least two languages,

were asked to develop named-entity recognition systems that include a machine

learning component. Among the twelve systems which participated to the

competition many of them used a variety of machine learning technique.

The system of Xavier Carreras et al. (2002), which used AdaBoost applied to

decision trees, outperformed all other systems by a significant margin, both on the

Spanish test data and the Dutch test data. Sixteen systems have participated,

instead, in the CoNLL-2003 shared task. For English, the combined classifier of

Florian et al. (2003) achieved the highest overall F1 rate.

An important feature of the best system that other participants did not use, was the

inclusion of the output of two externally trained named entity recognizers in the

combination process. Florian et al. have also obtained the highest F1 rate for the

German data.

10

CoNLL is the acronym of Conference on Natural Language Learning. See

http://www.cnts.ua.ac.be/conll2002/ for details.

46

For Named Entities recognition D’Avanzo (2005) used LingPipe11, a suite of Java

tools designed to perform linguistic analysis on natural language data. The tool

includes a statistical named-entity detector, a heuristic sentence boundary detector,

and a heuristic within-document co-reference resolution engine. Named entity

extraction models are included for English news and can be trained for other

languages and genres.

LingPipe has been chosen, first of all, for its implementation that yelds it to be

easily integrated in the LAKE system. Besides, training named entity detection is

very fast. For example CoNLL 2002 Spanish, consisting of 370K tokens in a line-

based format, took 23 seconds. The named entity recognizer, using the English

news model, and given an array of tokens, produces the array of tags at 100K

tokens/second.

Head Extraction

The linguistic principle of headedness (Arampatzis et al., 2000) claims that any

phrase has a single word as a head. The head is the main verb phrases, and the noun

in noun phrases. Others components of a phrase consist of modifiers. Following the

approach of Arampatzis et al. (2000) every phrase can be represented by a phrase

frame

PF = [h,m]

where the head h gives the central concept of the phrase and the modifiers m serve

to make it more precise. Following this principle for each candidate phrase the

system extracted the head making use of the POS information.

11

LingPipe is free, available at http://www.alias-i.com/lingpipe/index.html

47

Then, for each head extracted the two feature TF × IDF and first occurrence have

been calculated. Afterward, the next step consists of the training or the extraction

depending on the job.

Candidate Phrase Extractor

Syntactic patterns which describe either a precise and well defined entity or concise

events/situations were selected as candidate phrases. Once that all uni-grams, bi-

grams, tri-grams, and four-grams were extracted from the linguistic pre-processor,

they were filtered with the patterns defined above.

Candidate Phrase Scorer

In this phase a score is assigned to each candidate phrase in order to rank it and to

allow the selection of the most appropriate phrases as representative of the original

text.

The score is based on a combination of TF × IDF and first occurrence. However,

since the frequency of a candidate phrase in the whole collection is not significant,

candidate phrases do not appear frequently enough in the collection. As learning

algorithm, it uses the Naïve Bayes Classifier provided by the WEKA package

(Witten and Frank, 1999).

The model obtained is reused in the subsequent steps. When a new document or

corpus is ready we use the pre-processor module to prepare the candidate phrases.

The model we got in the training is then used to score the phrases obtained. In this

case the pre-processing part is the same. So, using the model we got in the training,

48

we extract nouns and verbs from documents, and then we keep the candidate

phrases containing them.

2.4. Linguistic and Statistic Phrases: some remarks

In the past many attempts have been done to use semantically richer features in

Information Retrieval tasks. In particular, a number of authors have investigated

the use of phrases instead of, or in addition to, individual words, as features.

Caropreso et al. (2001) have individuated a number of advantages of using

statistical phrases with respect to syntactic ones:

• they may be recognized by means of more robust and less computationally

demanding algorithms;

• the effect of irrelevant syntactic variants can be factored out;

• uninteresting phrases (e.g. tail professor) tend to be filtered out from

interesting ones (e.g. associate professor);

• the use of lexical atoms, such as “hot dog”, to replace single words for

indexing would increase both precision and recall;

• the use of syntactic phrases, such as “junior college” to supplement single

words would increase precision without hurting recall;

• the use of phrases as index terms, a document that contains a phrase would

be ranked higher than a document that contains just its constituent words in

unrelated contexts.

Arampatzis et al. (2000) have performed an evaluation of a linguistically motivated

indexing model. The approach taken by the authors is based on part-of-speech

(POS) tagging and syntactic pattern matching. Different experiments have been

49

performed with a representation based on combinations of different POS

categories. These representations combine elements belonging to the category of

nouns with those of adjectives, verbs, and adverbs.

Table 3: Results of the experiments performed with part-of-speech-tagging by Arampatzis et al.

(2000). In the last column the percentage of feature reduction wrt the baseline is reported.

The different representational choices are compared to the baseline (i.e. using all

single words as index terms).

To evaluate the different indexing schemes the authors have measured the

performance in a text categorization task. The experimental system is based on the

vector space model, where terms are weighted in a TF × IDF fashion, and

classifiers are constructed automatically using Rocchio’s relevance feedback

method (Salton, 1988). The experiments have been performed on the Reuters-

21578 text categorization collection (Sanderson, 1994) using 90 out of 135

categories and all stemmed words as baseline.

Table 3 summarizes the results discussed below. The experiments with unstemmed,

stemmed and lemmatized words as index terms showed improvements in average

precision less than 5%. The experiments based on indexing sets derived from

50

combinations of part-of speech categories presented, as well, improvements over

the baseline. Moreover, a reduction of the feature space of the 20.8% was obtained.

The authors have concluded that the union of names and adjectives performs best,

the addition of verbs worsens the performance, while adverbs do not make any

difference.

N-grams are another kind of index terms. In their work Meng et al (2002) support

an attempt to improve categorization performance by automatically extracting and

using phrases, especially two word phrases (hereafterbigrams). The bigrams

extracted have been used in addition to (and not, in place of) single words.

The experiments have been performed on two test corpora: a collection of web

pages pointed to by the Yahoo! Science hierarchy and the Reuters-21578 corpus. In

the following we will focus only on the results regarding the Reuters-21578 corpus,

more similar to our experimental corpus.

To measure the performance the authors have used recall and precision. In

particular, the break-even point (BEP), which is the point where recall equals

precision, has been used, and which is often used as a single summarizing measure

for comparing results. Moreover, the F1 measure has been used for evaluating the

performance.

BEP increases in all categories under examination (12 out of 135), with the highest

value at 21.4%. However, the performance as measured by F1 was mixed. While

the largest improvement remained at 27.1%, 5 out of 12 categories showed a drop.

Table 4 shows the recall and precision rates before and after adding bi-grams. The

number of bi-grams the algorithm finds is no more than 2% of the number of single

words, avoiding thus the problem of high dimensionality. Note that when bigrams

51

alone were used, precision decreased drastically, while recall increased

substantially.

Table 4: Recall and precision reported by Meng et al (2002)

When both uni-grams and bi-grams were used, recall improved, without significant

decrease in precision. This means (D’Avanzo, 2005) that bigrams are very good at

identifying correct positives, but at the same time they introduce a large number of

false positives, suggesting that bigrams should be used as complements to the

unigrams.

52

Chapter 3

The Energy domain case study

Energy domain is a very unstructured and large domain. There are many

geopolitical and economic problems caused by energy and there are many

authorities and organizations that treat energy as a problem, but the knowledge

domain is very fragmentary. It could be very difficult to model it and for this

purpose we built an ontology to better knowledge management. We started to

retrieve web documents about “Energy”, first using search engines and then using

crawlers. These documents allowed us to extract keyphrases used as concepts in

order to build the ontology, both manually and semi-automatically.

There are two main approaches to domain modeling (Poesio, 2005):

• Top down (Formalist)

In which we could start from a pre-existent classes hierarchy in order to

deduce concepts and relations belong the ontology until shorter level of

subclasses represented by instances, i.e. the simplest, non complex and non

articulated concepts.

• Bottom up (Empirist or corpora based)

We could start from the basis analyzing a corpus, a collection of documents

extracting concepts, relations and lexical resources useful to populate the

ontology in order to induce, successively, the most generic superclasses.

Energy Domain Modeling is a “middle up down” one because, on the one hand, we

was supported by an energy domain expert, Dr. Brenda Shuffer12, that suggested us

12

University of Haifa.

53

the most relevant superclasses in which we focused our following researches, and

on the other, we continued the ontology taxonomy collecting a large corpus of texts

and extracting concepts by them in order to populate the ontology.

The statistics about Energy Ontology gave us following results:

• Corpus: 200 documents

• Classes: 53 concepts

• Instances: 121 concepts

• Properties: 30 relations (inverse functional are included)

Energy Ontology is more developed in width than in depth, it’s demonstrated in

the following figure.

Figure 3. Energy Ontology Classes

For instance (figure 4), in our ontology we can describe “Saudi Arabia” such an

instance of “Country” class and “Oil” such an instance of “Fossil Fuel” class. So

54

we can link these two instances using relation “to be the major producer of” in

order to infer “Saudi Arabia is one of the major producer of Oil”.

In a preliminary phase we manually built the ontology, then, in a successive step,

we enriched it with new concepts, relations an lexical resources using automatic

systems. Below we describe the manual stage.

We used Protegé 3.3.1, an ontology editor developed at Stanford University to

build Energy Ontology. “Energy_Domain” is our superclass, a sort of root node

that contains all other subclasses, but on the same hierarchy level there are 2

classes also:

• Risks

• Solutions

They have been included in the same level because they are transversal across the

ontology. So, in this way, we could create horizontal links between classes and

instances into the hierarchy tree, they are properties, binary relations expressed

using natural language like the one described in the previous example “to be one of

the major producer of”.

Figure 4. Example of relation between two instances.

INSTANCES:

Saudi Arabia

Russia

Etc.

INSTANCES:

Oil

Natural Gas

Etc.

CLASS: COUNTRY CLASS: FOSSIL FUELS

To be one of the major producer of

55

There are different concepts in Risks within the ontology such as Economic

problems i.e. “Market Instabilities” or “Economic Cost of Oil”; Geopolitical and

Technical Problems i.e. “Dispute between Russia and Belarus about Natural Gas”,

“Dispute between Venezuela and U.S. about Oil”, or for instance we can have

related topics like “Terrorism”, “Sabotage” and “Natural Disasters”.

Regarding Solutions, instead, there are two subclasses containing “Long Term

Actions” and “Short Term Actions”. In this last one, for example, we inserted

“Infrastructure Improvement”, “Use of Renewable Sources”, etc..

Every class or instance is linked to each other using properties (also inverse

functional, figure 5) in order to get a very dense network (figure 6) useful to model

the knowledge domain of energy and to allow us telling something, for example

“Infrastructure Improvement involves Russia”.

Figure 5. Example of inverse functional properties

INVERSE PROPERTY:

is involved in

CLASS: SOLUTIONS >

LONG TERM ACTIONS CLASS: COUNTRY

INSTANCE:

Infrastructure

Improvement

INSTANCE:

Russia

PROPERTY: involves

56

Figure 6. Energy Ontology Network

The Energy Domain Class consists of six classes (as illustrated by figure 7):

Figure 7.

57

1. Country

In this class we created two subclasses OPEC Countries and NON-OPEC

Countries., for example:

• OPEC: Algeria, Angola, Indonesia, Iran, Iraq, Kuwait, Libia,

Nigeria, Qatar, Saudi Arabia, United Arab Emirates, Venezuela.

• NON-OPEC: Belarus, Brazil, Canada, China, Georgia, India,

Malaysia, Mexico, Norvey, Russia, Turkmenistan, Ukraine, United

Kingdom, United States, Uzbekistan.

2. Energy Security

It has four main subclasses:

• Friendliness To The Environment;

• Reliability of Supply, which has two instances (“Disruption” and

“Vulnerability”) and other subclasses relate to “Cut-off in supply”;

• Resources Affordability, which has two subclasses:

� "Emergency Stocks” with “Strategic Petroleum Reserve”

(SPR) as instance;

� “Scarce Resource Affordability”;

• Scarce Resources Dependency which has a subclass (Fossil Fuel

Dependency) with two instances:

� Natural Gas Dependency;

� Oil Dependency.

Energy Security is the class on which we focused our experiments because treats

important economic and geopolitical problems.

58

3. Energy Sources

We classified all energy types in this class, distinguishing three types of

sources: Primary and Secondary, subdivided in turn into Renewable/Non-

Renewable, and Nuclear. We chose this distinction because it’s the most

used.

A primary source of energy is one that already exists in nature and can be

used directly, or converted or re-directed into a form of energy that satisfies

our needs. For example all Fossil Fuels, inside Non-Renewable, such as

Antracite, Bituminous Coal, Coal, Graphite, Lignite, Liquefied Petroleum

Gas, Natural Gas, Oil, Peat, Propane.

Instead, inside Renewable: Biomass, Corn, Geothermal, Photovoltaic,

Solar, Solid Waste, Sugar Beat, Sugar Cane, Thermal, Waste, Water, Wind,

Wood.

Secondary energy sources, such as electric power or refined fuels, do not

exist in nature, but can be produced from the primary energy sources.

Secondary sources are important because they are frequently easier to use

than the primary sources from which they are derived. In this class there are

instances as:

• Non-Renewable: Coke, Diesel Fuel, Gasoline Hydrogen by Coal,

Hydrogen by Fossil Fuel, Hydrogen by Natural Gas;

• Renewable: Alcohol Fuel, Biofuel, Ethanol, Hydrogen by Biomass,

Hydrogen by Solar, Hydrogen by Water.

About Nuclear class we identified two subclasses:

• Nuclear Energy Proliferation;

59

• Nuclear Weapons Proliferation.

Electricity, for example, is the flow of electrical power or charge. It is a

secondary energy source which means that we get it from the conversion of

other sources of energy, like Coal, Natural Gas, Oil, Nuclear Power and

other natural sources, which are called primary sources. The energy sources

we use to make electricity can be renewable or non-renewable, but

electricity itself is neither renewable or non-renewable, it depends on

transformation process (EIA Energy Information Administration)13.

A similar case of difficult classification concerns Hydrogen, it’s can be

classified both as Non-Renewable if it’s extracted by hydrocarbon, both as

Renewable when it’s extracted by renewable elements (water, solar…).

This is the most large class, in fact has a many properties useful to describe

concepts and their relations, following the most relevant:

Coal (Sources class) cause (property) Changes in Climate (Environment

class) AND is used for (property) Cooking OR Heating (Residential Class).

We can use Boolean values “AND” “OR”, for example, in order to add

more information or to create more horizontal links.

4. Energy Use

It contains five subclasses in which the domain literature divides the

society, so considers following sectors: Commercial, Electric, Industrial,

Residential, Transportation. For instance in “Residential” we can have

“Cooking” and “Heating” like instances.

13

http://www.eia.doe.gov/kids/energyfacts/sources/electricity.html.

60

5. Environmental Consequences

We identified one subclass “Changes in Climate” that includes a lot of

environmental impact aspects and everyone is linked to class of Energy

Sources using properties such as “is caused”.

In general we can say that “Changes in Climate” “includes” (property)

many individuals: Flooding, Increase Rain, Global Warming, etc. in this

manner we’ve linked a class with their instances. But an important

inference concerns horizontal links between this class and that one of

Energy Sources:

Natural Gas

Changes in Climate is caused by Biomass

Coal

…

There are many other environmental aspects, for example:

Bird Flight Patterns, CO2 Emissions, Damage to views, Deforestation,

Desertification, Droughts, Flooding, Global Warming, Greenhouse Gas

Emissions, Higher Global Temperatures, Increased Rains, Large Land Use,

Pollution, Radioactive Off Scouring.

6. Infrastructure

Includes ten instances that represent different transports for energy or

structures of supply: Barge, Drilling Equipments, Heliostats, Methane

Pipelines, Oil Tanker, Pipelines, Refinery, Ship, Turbine, Train.

61

3.1 Lexical Acquisition for Ontology population

3.1.1 Ontological organization

There are several methods to organize word within lexicon. In a dictionary, for

example, words are alphabetically organized to provide an easy access, but the

alphabetical order isn’t very helpful to process semantic properties, a more intuitive

way is to organize words according to their meaning. A classification certainly

better represents the structure of our knowledge in a specific domain and is more

adequate for semantic processing. Grouping and classifying words in an ontology

means identifying hyponyms and hypernyms, reverse relationship, using

respectively more specific or more general terms. For example in energy domain

“oil” is hyponyms of “energy source” (hypernyms), there is an “IS A” relation: “oil

IS A type of energy source”. Another important relation is called meronymy, the

opposite is holonymy. For each word (concept) we can link parts to the whole

using relation “HAS A”. In our case “economic cost of oil” means “oil HAS AN

economic cost”. But we can create a double relation using the same keyphrase,

depends on purpose “cost of oil IS AN economic cost”. The relations explained

above help us building a hierarchy tree, but we can enrich a domain ontology using

relations in order to link concepts within different level of hierarchy. Figure 8

shows hierarchy links (full arrows) and relations links (dashed arrows), together

represent the semantic network.

62

Figure 8. Semantic Network: Hierarchy (full arrows) and Relations (dashed arrows).

Ex. 1

“Russia IS A Country of Energy Domain and EXPORTS Gas that IS A type of

Energy Source”

Ex. 2

“Saudi Arabia IS A Country of Energy Domain and EXPORTS Oil that IS A type

of Energy Source”

The most main units in these sentences are called head units consisting in verbs

“IS” and “EXPORTS” because represent fixed element like showed below:

“Saudi Arabia/Russia EXPORTS Oil/Gas”

Energy Domain

Countries Energy Sources

Russia China Saudy Arabia Oil Gas

is a is a

exports

exports

exports

63

“EXPORTS” select “Saudi Arabia” or “Russia” (as subject) on the left and “Oil” or

“Gas” (as object) on the right. Subject and object are grammatical categories, but

they can been called also semantics role such as respectively “Exporter” and

“Thing Exported”. In our examples the role of “Exporter” is always represented by

a “Country”, while the one of “Thing Exported” by an “Energy Source”. In this

way we obtain fixed linguistic distributions, referring to the linguistic theory of

Operators and Arguments (Harris, 1970, 1976) and in according to the selectional

restrictions principle, that help us to reduce syntactic and semantic ambiguity

during parsing analysis.

An example of exhaustive lexical organization: WordNet

An example of exhaustive lexical organization is WordNet (Miller, 1995,

Fellbaum, 1998), a lexical database of English. WordNet builds a matrix (figure 9)

including different meanings of a word in column, defined as polisemy problem,

and different word forms in line, defining a synset, a set of synonyms.

Word meanings

Word forms

F1 F2 … Fn

M1

E1,1

E1,2

M2 E2,2

…

Mn Em,n

Figure 9. The lexical matrix (Miller et al. 1993)

F1 and F2 are synonyms (both have meaning M1) and F2 is polysemous (it has

meaning M1 and M2).

64

3.1.2 Parsing with Case Grammar

Some languages like Latin, Russian and German indicate grammatical functions in

a sentence by a set of inflections, the cases, also defined semantic or thematic

roles (Fillmore, 1968). Parsing with case grammar formalism transform a sentence

into a kind of logical form: the frame (Nugues, 2006). The predicate is the main

verb, and its argument are the roles. The parsing process identifies nouns, tenses,

adverbs, etc. and then, in according to the verb, distinguishes obligatory cases and

optional cases. They will be assigned with at most one value and are constrained by

an ontological type. Figure 10 shows an example using following sentence:

“Saudi Arabia exports oil in Italy with pipelines”

Role Type Value

Agent (Exporter) Obligatory Saudi Arabia

Objective (thing exported)

Obligatory Oil

Where Optional In Italy

How Optional With pipelines

Figure 10.

An example of Case Grammar application: Evar

The EVAR application (Mast et al., 1994) is a spoken dialogue system that answers

inquiries about train connection in Germany using a dedicated ontology. It uses a

partial parser to detect syntactic groups such noun groups, preposition groups,

infinitive groups, verb groups and time and date expressions.

65

EVAR Hello, this is the automatic Intercity information system

USER I want to go to Hamburg

EVAR You want to go to Hamburg

Where do you want to start?

USER From Elrangen

EVAR You want to start from Erlangern

Which day?

USER Tomorrow morning

EVAR You want to travel on Thursday 26th of June between 6 and 12 a.m.

USER Yes

EVAR To Hamburg you can start from Erlangen at 11.01, arrival in

Hamburg…

Shell I repeat?

USER No

EVAR Would you like more informations?

USER No

EVAR Good bye

3.1.3 Lexical Acquisition Fundaments

Collocations

A collocation is an expression consisting of two or more words that correspond to

some conventional way of saying things (Manning and Schütze, 1999), that has

characteristics of a syntactic and semantic unit and whose exact and unambiguous

meaning or connotation cannot be derived from the meaning or connotation of its

66

components (Choueka, 1998). It can include noun phrases “Fossil Fuel”, phrasal

verb like “to cut off (supply)” and other stock phrases “disruption & vulnerability”.

Collocations are characterized by:

• Limited Compositionality

An expression is compositional if its meaning can predicted from meaning of

the parts. For example “acid rain” is a type of rain but we must consider the

expression as single word.

• Limited Substitutability

We cannot substitute other words for the components of collocation even if

they are synonyms. “acid rain” doesn’t means the same as “sour rain”, the last

one is very unusual.

• Limited Modifiability

Many collocations cannot be freely modified with additional lexical material or

grammatical tranformations. “Strategic Petroleum Reserve” (also called using

acronym SPR) is a strict collocation because we can’t modify, for example, the

adjective “strategic” in a relative clauses “Petroleum Reserve that are

Strategic” because is unusual even if its meaning doesn’t change in the context,

the existence of acronym is a confirmation of that.

The rules showed above are more and more substantiated in a technical domain,

during the terminology extraction process, technical terms are more specific than in

a natural language context, so every compound term is a collocation.

To identify collocations in a text or in a corpus it needs to calculate the frequency

counting how many times two (or more) words occur together.

67

In this way we obtain a list of Part of Speech (POS) Pattern, frequency order based

(Table 5).

Table 5 Part of Speech Pattern and frequency

Type of

Phrase

Pattern Example Frequency

Bi-Gram AN

NN

affordable energy

energy conservation

10

5

Tri-Gram NPN

ANN

VPN

APN

NCN

demand for energy

fossil fuel dependency

to invest in

alternatives

vulnerable to

shortages

disruptions &

vulnerability

7

13

4

3

17

Four-Gram ANPN

NPAN

NPNN

wasteful use of

resources

dependence on

foreign oil

cost of oil

dependence

4

8

5

Fifth-Gram ANPNN economic cost of oil

dependence

3

Sixth-Gram ANPNNN macroeconomic cost

of oil market

disruptions

3

Seventh-

Gram

NPNPDNN speed of changes in

the climate system

2

Tenth-Gram DNPAANPNNN The world’s second

largest emitter of

greenhouse gas

emissions

1

The general goal of lexical acquisition is to develop algorithms and statistical

techniques for filling the holes in existing machine readable dictionaries by looking

at the occurrence patterns or words in large text corpora.

68

First we have to define what is a lexicon: the part of the grammar of a language

which includes the lexical entries for all the words in the language and which may

also includes other information (Manning and Schütze, 1999).

A lexicon is a kind of expanded dictionary formatted in a machine readable format.

Traditional dictionaries are written for human usage so quantitative information is

completely absent. An important task of lexical acquisition for Statistical Natural

Language Processing (NLP) is to augment dictionaries with quantitative

information (Manning and Schütze, 1999).

In the following pages we will illustrate many lexical acquisition problems besides

collocations: selectional preferences, subcategorization, semantic

categorization.

1. Selectional preferences

Most verbs prefer specific arguments, for example “export” take “exporter” as

subject and “thing exported” as object (exporter = Country, thing exported =

energy source, in our specific energy domain), these semantic constraints are called

selectional preferences or restrictions. The acquisition of selectional preferences is

important in Statistical NLP for many reasons:

• If a new word is missing from our machine-readable dictionary we can infer

part of its meaning from selectional restrictions.

For example:

“Saudi Arabia exports oil”

“Country A exports lignite”

69

if “lignite” is a new and unknown word, but it occur with verb “export” in the

same context of “oil”, then we can infer that “lignite” is an energy source as

such as “oil”.

• Selectional restrictions are very useful to rank possible parses of a sentence.

In fact we can give high scores to parses where the verb has usual

arguments than to those atypical ones. Semantic regularities captured in

selectional preferences are often quite strong and can be acquired more

easily from corpora than other types of semantic information (like the

meaning).

In selectional preferences model proposed by Resnik (1996) he uses two principles:

1. selectional preference strength

measures how strongly the verb constraints its argument;

2. selectional preference association

measures the association between a verb and a class in which we

defined correlated arguments. For example:

“Saudi Arabia exports oil”

“Country A exports competences”

“exports oil” is a different use respect to “exports competences” because in a

disambiguation process we associate competence with class containing similar

noun in order to verify that “competence” not belongs to the same class of “oil”

(energy source class).

70

2. Verb subcategorization

The verb “export” has two arguments, “exporter” (subject) and “thing exported”

(object) useful to define a phrase structure Noun Phrase (NP)+Verb+Noun Phrase,

as “Saudi Arabia exports Oil”.

A verb expresses its semantic arguments with different syntactic means. The set of

syntactic categories that a verb can appear with is called subcategorization frame

defining a phrase structure. For example:

Example Frame Functions Verb

“Saudi Arabia exports oil” NP NP subject, object export

“China invests in alternatives” NP NP subj., indirect obj. invest

…

Most dictionaries don’t contain information on subcategorization frames and the

information on most verbs is incomplete. A simple and effective algorithm for

learning some subcategorization frames was proposed by Brent (1993) and

implemented in a system called Lerner.

This system decides in which manner the verb V takes the frame F in a corpus in

following two steps:

a. Cues

Define a regular pattern of words and syntactic categories which indicates the

presence of the frame with high certainty. For example we can select the frame

“NP NP” (as showed above).

b. Hypothesis testing

Initially we assume a null hypothesis, the frame is not appropriate for the verb,

but we will reject it if the cue, defined previously, will been identified in the

71

corpus. So the system analyze a corpus counting how many times that cue for

the frame occur with the verb.

3. Semantic Categorization

Attachment Ambiguity

Often there are keyphrases that can be attached to two or more different elements in

the structure. A keyphrase, sequence of two or more words, can have the left-

branching structure [(N N) N] as [(transportation fuel) crisis] that is a “crisis” about

“transportation fuel”; or the right-branching structure [N (N N)] as [woman (aid

worker)] that is an “aid worker” who is a “woman”.

Semantic Similarity

The most conquest of lexical acquisition could be automatically acquire meaning.

But actually many works are focused on semantic similarity measuring how

similar a new word is to known word.

It is most used for generalization under the assumption that semantically similar

words behave similarly (as showed above for selectional preferences “Saudi Arabia

exports oil” and “Country A exports lignite”).

Another use of semantic similarity is class-based generalization considering the

whole class of elements that the word of interest is most likely to be a member of.

K nearest neighbors classification (KNN), instead, uses semantic similarity in

order to classify a new element in a defined category. We first need a training set of

element that are each assigned to a category and then the system classify a new

element to the category that is most prevalent among its k nearest neighbors.

72

3.1.4 Lexical Acquisition Evaluation

An important recent development in NLP has been the use of rigorous standard for

the evaluation of NLP systems.

In an evaluation process often they been used two measures, precision and recall,

using a set of targets (targeted documents in which keyphrases are known).

The system then decides on selected set of documents accepting or rejecting

keyphrases extracted, matching results obtained with the train set.

Precision measures number of keyphrases automatically extracted and

corresponding to manually extracted on the total of automatically extracted: 3/7.

Recall measures number of keyphrases automatically extracted and corresponding

to manually extracted on the total of manually extracted: 3/10.

Figure 11 shows the process:

Manually extracted Automatically extracted

(Training set)

Figure 11.

affordable energy

demand for energy

energy consevation

energy needs

environmental issues

fossil fuel dependency

fossil fuel depletion

to invest in alternatives

oil depletion

peak oil

cut off in supply

demand for energy

economic cost of oil

environmental issues

fossil fuel depletion

oil consumption

renewable energy

73

3.1.5 The role of lexical acquisition in statistical NLP

Lexical acquisition is very important in Statistical NLP for the following reasons:

1. cost of manually building lexical resources

Manual lexical analysis are more accurate but expensive and often humans are

bad at collecting quantitative information.

2. productivity of language

Natural languages are in a constant state of flux, adapting to the changing world

by crating names and word to refer to new things, lexical resources have to be

updated with these changes.

3. lexical coverage

Many dictionaries are incomplete because there are some categories of lexical

entries that have a bad coverage in a common dictionary such as proper nouns,

foreign nouns, codes, abbreviations, etc. so automatic acquisition is useful to

update dictionaries.

3.2 Experiments with Energy domain ontology

3.2.1 Methods and tools

The experiment described in this work consists of following steps. We started from

a corpus of 200 documents about Energy domain in order to manually building a

domain ontology. To be useful an ontology needs to be continuously updated with

new concepts, relations and lexical resources that are represented by keywords in a

document. So manual ontology building is more accurate than automatic one but is

74

more expensive and time consuming too, because for a large corpus of texts it

needs too much time to read all documents and select keywords for everyone.

We focused on one of the classes of the Energy Ontology, Energy Security,

selecting a subset of texts represented by 50 documents, successively we choose 25

documents more relevant about this topic on which we based our experiment

(figure 12).

Figure 12

The ontology population process consisted of three main steps:

1. Documents Retrieval

Generally the documents are manually recovered on the web putting queries in

a search engine, but, in our case, we used a crawler to do it. A crawler is an

automatic agent that we can set using special queries and that automatically

searches documents on the web. We compared three different crawlers:

Infospiders, Best First and SS Spider focused on Energy Security Class using

75

queries made with “class concept + subclass concepts”, (for instance “Energy

Security + Reliability of Supply”); the best one was Infospiders, its results of

precision and recall was the highest.

2. Keywords Extraction from documents.

We read every document and manually extracted significative keyphrases for

each of them. We selected only compound words that are more representative

in a text and that allowed us to describe an over hole document using only a

small collection of terms.

Figure 13 shows an example of text about Energy Security from which we

extracted a keyphrases list.

Figure 13. Example of Keyphrases Extraction.

Reliance on foreign sources of energy and geopolitics There has certainly been a

recognition in recent months and years that energy security is a concern. Even US president

George Bush admitted during his 2006 State of the Union speech that, Keeping America

competitive requires affordable energy. And here we have a serious problem: America is

addicted to oil, which is often imported from unstable parts of the world. The best way to

break this addiction is through technology.

…

The higher prices at petrol pumps in recent months may be a blessing in disguise if it makes

consumers also think more about energy conservation and alternatives, for the market may

respond to that.

…

Oil and other fossil fuel depletion Reliance on foreign sources of energy and geopolitics.

Energy needs and demands of growing countries such as China and India Economic

efficiency versus population growth Need to invest in alternatives to fossil fuels…

…

Oil and other fossil fuel depletion. Many fear that the world is quickly using up the vast but

finite amount of fossil fuels. Some fear we may have already peaked in fossil fuel extraction

and production. So much of the world relies on oil, for example, that if there has been a

peak, or if a peak is imminent, or even if a peak is some way off, it is surely environmentally,

geopolitically and economically sensible to be efficient in use and invest in alternatives.

Keyphrases: affordable energy, energy conservation, energy needs, to invest in

alternatives, fossil fuel depletion

76

Then we tagged obtained keyphrases annotating them with semantic and

linguistic information such as Parts of Speech, this manual process is

supervised by human that use his lexicographic expertise in a specific domain.

Obviously, he can extract a more accurate list of keywords than an automatic

system, but he couldn’t be speeder.

So the manual extraction presented some drawbacks:

• time consumption

• expensiveness

Figure 14 shows an example of PoS annotation using syntactic labels.

Figure 14.

In this stage we identified 322 keyphrases (see Appendix), representing our

gold standard in the experiment because manually extracted and then we

trained LAKE with them in order to compare manual results with automatic

results.

77

3. Classification of Keyphrases as concepts in the ontology.

In the last stage we used automatically extracted keywords in order to populate

the ontology. In this way we could add new concepts for which we assigned

some representative keywords using two approaches with two different tools:

• Manual

Putting extracted keyphrases into an ontology editor, Protégé in our

case, creating classes, instances and properties in order to infer

hyponymy and hyperonymy relations, subsumption relations, and

linking these elements to form a semantic network.

• Semi-automatic

Using a tool like ONTOGEN based on machine learning approach

extracting automatically keywords from a large corpus of documents

and training the system in order to classify these keywords as concepts

of the ontology. Ontogen allowed us to create subsumption relations

between extracted concepts.

Lexical acquisition is a methodology that allows us to get lexical units in order to

describe a concept using lexical knowledge. In our experiment we considered

Keyphrase Extraction (KE) such as a lexical acquisition technique. It’s an

automatic method to extract relevant keyphrases from a text and we used it because

allowed us to populate the ontology with new concepts, relations and lexical

resources.

This work focused on the use of a linguistically motivated methodology to identify

candidate keyphrases. After the identification of the linguistically motivated

candidate keyphrases making use of named entities, multiwords, sequences of PoS

78

tagging (referred as patterns), the process keeps on by selecting the best candidates

based upon a learning device.

The methodology has been implemented in LAKE, a keyphrase extractor based on

a supervised learning approach (see previous sections for more details).

In the former case, the focus was on bi-grams (for instance Named Entity, noun,

and sequences of adjective+noun, etc.), while in the latter longer sequences of parts

of speech are considered, often containing verbal forms (for instance

noun+verb+adjective+noun). Sequences such as noun+adjective that are not

allowed in English were not taken into consideration. Patterns containing

punctuation have been eliminated. A restricted number of PoS sequences have been

manually selected that could have been significant in order to describe the setting,

the protagonists and the main events of a document. To this end, particular

emphasis was given to named entities, proper and common names.

In a keyphrase the head unit is most important because represents the concept

from which depend other components, for example in “affordable energy”

“energy” is the head unit because “affordable energy” is a type of “energy”, the

same in “energy conservation” which means a type of “conservation” and so on.

Head unit helps us to identify possible co-occurrences and transformations (Harris

1970) that can be used in a sentence starting from that specific keyphrase.

Type of Phrase Pattern Example Head unit

Bi-Gram AN NN



79

It has been decided to estimate the values of the TF*IDF using the head of the

candidate phrase, instead of the phrase itself, according to the principle of

headedness (Arampatzis, 2000), any phrase has a single word as a head. The head

is the main verb in the case of verb phrases, and a noun (last noun before any

postmodifiers) in noun phrases. As learning algorithm, it has been used the Naïve

Bayes Classifier provided by the WEKA package (Witten and Frank, 1999).

The classifier was trained on a corpus with the available keyphrases. Each of them

was marked as a positive example of a relevant keyphrase for a certain document if

it was present in the assessor’s judgment of that document; otherwise it was

marked as a negative example.

Then the two features (i.e. TF*IDF and first occurrence) were calculated for each

word. The classifier was trained upon this material and a ranked word list was

returned (e.g. energy, oil, exporter, etc.). The system automatically looks in the

candidate phrases for those phrases containing these words. In our case affordable

energy, Arabian oil, main exporter, etc. The top candidate phrases matching the

word output of the classifier are kept.

3.2.2 Results

In the experiment we used 25 documents for which we knew manually extracted

keyphrases and calculated the average of precision and recall. The results were

quite encouraging because we got about 56% of precision and 40% of recall (figure

15).

First of all, we trained LAKE system using 20 documents and then we used the

remaining 5 documents to test the system.

80

Figure 15. Average Precision and Average Recall

In the experiment on Energy Security Class we obtained patterns consisting of bi-

gram until four-gram like above:

Type of Phrase

Pattern Example Head unit

Bi-Gram AN NN



Tri-Gram NPN ANN VPN APN NCN

demand for energy fossil fuel dependency to invest in alternatives vulnerable to shortages disruptions & vulnerability

N (demand) N (dependency) V (to invest) A (vulnerable) N (distr. – vuln.)

Four-Gram ANPN NPAN ANNN AANN NPNN

wasteful use of resources dependence on foreign oil liquid transportation fuels crisis unstable foreign oil supplier cost of oil dependence

N (use) N (dependence) N (crisis) N (supplier) N (cost)

Syntactic patterns have a twofold objective:

• focusing on bi-grams (for instance Named Entity, noun, and sequences of

adjective+noun, etc.) to describe a precise and well defined entity;

• considering longer sequences of PoS, often containing verbal forms (for

81

instance noun+verb+adjective+noun) to describe concise events/situations.

Once all the bi-grams, tri-grams, and four-grams are extracted from the linguistic

pre-processor, they are filtered with the patterns defined above. The result of this

process is a set of keyphrases that may represent the current document.

3.3 Discussion

In this study we presented a linguistically motivated text mining approach using,

mainly, lexical acquisition and keyphrase extraction methodologies in order to

populate the energy ontology, first manually and then automatically. In the manual

stage, we extracted 322 keyphrases from 25 documents that are well defined

because supervised by the lexicographer expertise. In the automatic stage we used

LAKE as extractor of keyphrases on the same documents. Finally, we compared

both approaches, comparing results and calculating the average of precision and

recall. The results are quite encouraging because the precision measures 56% and

recall 40%. LAKE extracted, for more than half of cases, relevant keywords and

many of which correspond to those human. Furthermore, the most important results

concern some isolated cases of keywords automatically extracted by system and

deemed relevant, but not covered by the lexicographer during first manual

extraction step, they mainly concern Energy domain and not specifically Energy

Security. Since we manually extracted keyphrases from those 25 documents only

about Energy Security, we can suppose that more high results could be obtained if

we compare automatically extracted keyphrases with manually extracted ones

considering the general Energy domain.

82

Chapter 4

Conclusions

Ontology population is a serious problem that we solved using two main

approaches, manual and automatic. First we retrieved 200 documents on the web

using search engines and crawlers building manually the most important part of the

energy domain ontology. We compared three different crawlers: Infospiders, Best

First and SS Spider focused on Energy Security Class using queries made with

“class concept + subclass concepts”, (for instance “Energy Security + Reliability of

Supply”); the best one was Infospiders, its results of precision and recall was the

highest. In a successive step we red the most part of documents to manually

building the energy domain ontology. Then we considered many automatic or

semi-automatic techniques for populate the ontology, but in our experiment we

used Lexical Acquisition training LAKE, a keyphrases extraction algorithm, with

25 linguistically annotated documents. At last we compared manual versus

automatic ontology building and the results were quite encouraging, calculating the

average of two measures, precision and recall, respectively of 56% and 40%.

Obviously a manual building of an ontology is more accurate because the human is

an expert lexicographer, but it’s more time consuming and expensive, so an

automatic or semi-automatic approach can be very useful in some steps.

4.1 Future Works

In a future work we will enlarge the corpus for the experiment considering 200 and

much more documents.

Successively we want to generate a Local Grammar for energy domain ontology.

83

It’s a bottom up approach and could help us to describe in which manner we use

language in a specific domain. A statement for a Local Grammar is composed by

lexical resources and part of speech pattern (figure 16), for example about Energy

Security we can have some keyphrases and type of phrases (affordable energy AN=

adjective+noun, reliability to supply NPN=noun+preposition+noun, etc.) and we

can build not only a controlled vocabulary but also we can better describe these

local syntactic constraints.

Figure 16. Example of Local Grammar

84

Appendix

Keyphrases manually extracted

affordable/energy,.AN:s+/N

air/pollution/stemming,.NNN:s-/N

Alaska's/Arctic/National/Wildlife/Refuge,.NPANAN:s-/N

alternative/energy/source,.ANN:s+/N

alternative/fuel,.AN:s+/N

alternative/supply,.AN:s+/N

American-made/car,.AAN:s+/N

atmospheric/GHG/concentration,.ANN:s+/N

atmospheric/greenhouse/gas/concentration,.ANNN:s+/N

availability/of/reliable/and/affordable/energy/supplies,.NPACANN:s+/N

availability/of/water,.APN:s-/N

cartel’s/market/share,.NPNN:s+/N

CECP/label,.NN:s+/N

Certification/Center/for/Energy/Conservation/Products,.NNPNNN:s-/N

China’s/crude/oil/import,.NPANN:s+/N

China’s/energy/consumption,.NPNN:s+/N

China’s/energy/demand,.NPNN:s+/N

China’s/expanding/search/for/oil/resources,.NPNNPNN:s-/N

China’s/growing/energy/security,.NPNNN:s-/N

China’s/growing/LNG/demand,.NPNNN:s-/N

China’s/impact/on/the/global/oil/markets,.NPNPDANN:s+/N

China’s/oil/demand/and/supply,.NPNNCN:s+/N

China’s/thirst/for/foreign/oil,.NPNPAN:s-/N

Chinese/economy/expansion,.ANN:s-/N

Chinese/energy/investment,.ANN:s+/N

85

clean/coal,.AN:s+/N

clean/energy/and/water/protection/in/China,.ANCNNPN:s+/N

clean/technology,.AN:s+/N

climate/change,.NN:s+/N

climate/change/strategy,.NNN:s+/N

climate/polluter,.NN:s+/N

coal/burning/for/energy,.NNPN:s-/N

commercial/stocks/decline,.ANN:s+/N

competition/for/fresh/water,.NPAN:s-/N

competition/for/resources,.NPN:s+/N

construction/of/a/heavywater/production/plant,.NPDNNN:s+/N

conventional/use/of/coal,.ANPN:s+/N

cooperation/or/competition/for/energy,.NCNPN:s-/N

cooperative/security,.AN:s-/N

cost/of/oil/dependence,.NPNN:s+/N

cut/in/oil/supply,.NPNN:s+/N

degree/of/import/concentration,.APNN:s+/N

demand/for/energy,.NPN:s-/N

demand/for/water,.NPN:s+/N

demand/growth,.NN:s-/N

dependence/on/foreign/oil,.NPAN:s-/N

dependence/on/gas,.NPN:s-/N

dependence/on/imports,.NPN:s-/N

dependence/on/oil,.NPN:s-/N

dependency/on/oil/imports,.NPNN:s-/N

dependent/on/the/Middle/East/for/oil,.NPDANPN:s+/A

destruction/of/water/ecosystems,.NPNN:s+/N

development/of/nuclear/technology,.NPAN:s+/N

disruption/in/gas/supplies,.NPNN:s+/N

86

disruption/in/oil/supplies,.NPNN:s+/N

disruption/of/energy/supplies,.NPNN:s+/N

disruption/of/Venezeluan/oil/supplies,.NPNNN:s+/N

disruptions/&/vulnerability,.NCN:s-/N

disruptions/and/vulnerability,.NCN:s-/N

distribution/&/transmission/annual/mileage,.NCNAN:s+/N

distribution/and/transmission/annual/mileage,.NCNAN:s+/N

diverse/supply/of/reliable,.ANPN:s+/N

diversification/of/energy,.NPN:s+/N

diversification/of/energy/sources,.NPNN:s-/N

diversification/project,.NN:s+/N

diversification/source,.AN:s+/N

domestic/alter/supply,.AAN:s+/N

domestic/alternative/to/LNG,.ANPN:s+/N

domestic/oil/production,.ANN:s+/N

draw/rate/capability,.VNN:s+/N

drawdown/coordination,.NN:s-/N

echologic/vehicle,.AN:s+/N

ecological/problem,.AN:s+/N

economic/cost/of/oil/dependence,.ANPNN:s+/N

economic/dependence,.AN:s-/N

economy/stimulation,.NN:s+/N

efficient/fuel,.AN:s+/N

Efficient/Lighting/Initiative,.ANN:s-/N

electricity/shortage,.NN:s+/N

electricity/supply,.NN:s+/N

emergency/preparedness,.NN:s-/N

emergency/stock,.NN:s+/N

emergency/system,.NN:s+/N

87

energy/conservation,.NN:s-/N

energy/consumption,.NN:s+/N

energy/cooperation,.NN:s-/N

energy/crisis,.NN:s+/N

energy/diversification/and/development,.NNCN:s+/N

energy/efficiency,.NN:s-/N

energy/efficiency/labeling,.NNN:s-/N

energy/efficiency/product,.NNN:s+/N

energy/independence,.NN:s-/N



energy/indipendence,.NN:s-/N

energy/insecurity,.NN:s-/N

energy/interdependence,.NN:s-/N

energy/interest,.NN:s+/N

energy/policy,.NN:s+/N

energy/price,.NN:s+/N

energy/price/fluctuation,.NNN:s+/N

energy/security,.NN:s-/N

energy/security/measure,.NNN:s+/N

energy/security/solution,.NNN:s+/N

energy/shortage,.NN:s+/N

energy/strategy,.NN:s+/N

energy/supplier,.NN:s+/N

energy/supply,.NN:s+/N

energy/technology,.NN:s+/N

energy-dependent/state,.NAN:s+/N

energy-hungry/economy,.NAN:s+/N

energy-transit/monopoly,.NNN:s+/N

88

environment/preservation,.NN:s-/N

environmental/and/social/impact,.ACAN:s+/N

environmental/degradation,.AN:s-/N

environmental/degradation,.AN:s-/N

environmental/impact/of/electricity/generation,.ANPNN:s+/N

environmental/issue,.AN:s+/N

environmental/regulation,.AN:s+/N

environmental/repercussion,.AN:s+/N

environmentally/energy,.AN:s+/N

foreign/policy,.AN:s+/N

fossil/fuel/dependency,.ANN:s-/N

fossil/fuel/depletion,.ANN:s-/N

fuel/efficiency,.NN:s+/N

fuel-efficient/motor/oil,.NANN:s+/N

fuels/crisis,.NN:s+/N

fuels/independence,.NN:s-/N

gas/distribution,.NN:s-/N

gas/gathering,.NN:s-/N

gas/transmission/offshore,.NNA:s-/N

gas/transmission/onshore,.NNA:s-/N

geopolitical/change,.AN:s+/N

geopolitical/crisis,.AN:s+/N

geopolitical/leverage,.AN:s+/N

geopolitical/risk,.AN:s+/N

geopolitical/security,.AN:s-/N

global/environment,.AN:s+/N

Global/Environment/Facility,.ANN:s-/N

global/insecurity,.AN:s-/N

global/oil/consumption,.ANN:s-/N

89

global/oil/demand/growth,.ANNN:s-/N

global/oil/trade,.ANN:s+/N

global/supply/of/oil,.ANPN:s-/N

global/temperature/rise,.ANN:s+/N

global/warming,.AN:s-/N

global/warming,.AN:s-/N

global/warming/emission,.ANN:s+/N

green/labeling,.AN:s-/N

growing/demand/for/energy,.ANPN:s+/N

hazardous/liquid/offshore,.ANA:s+/N

hazardous/liquid/onshore,.ANA:s+/N

hight/energy/price,.ANN:s+/N

hight/energy/price,.NNN:s+/N

hydrogen/fuel/cell,.NNN:s+/N

importation/of/nuclear/material,.NPAN:s+/N

improving/fuel/economy,.ANN:s+/N

invest/in/alternatives,.VPN/V

investment/strategy,.NN:s+/N

Iran’s/civilian/nuclear/energy/infrastructure,.NPAANN:s+/N

Iran’s/nuclear/aspiration,.NPAN:s+/N

Iran’s/nuclear/program,.NPAN:s+/N

James/Schlesinger,.NN:s-/N

liquid/accident,.AN:s+/N

liquid/fuels/independence,.ANN:s-/N

liquid/transportation/fuels/crisis,.ANNN:s+/N

Liquified/Natural/Gas/dependence,.AANN:s-/N

Liquified/Natural/Gas/terminal,.AANN:s+/N

LNG/dependence,.NN:s-/N

LNG/supply/chain,.NNN:s+/N

90

LNG/terminal,.NN:s+/N

longer-term/security,.ANN:s-/N

long-term/energy/policy,.ANNN:s+/N

macroeconomic/cost/of/oil/market/disruptions,.ANPNNN:s+/N

main/oil/nigerian/city,.ANAN:s+/N

military/expenditure,.AN:s+/N

military/implications,.AN:p-/N

mismanagement/and/hydropower/construction,.NCNN:s+/N

monopolistic/cartel,.AN:s+/N

national/security/implication,.ANN:s+/N

national/security/implication,.ANN:s+/N

natural/gas/transit/monopoly,.ANNN:s+/N

Net/Benefits/of/Stockpile/Expansion,.ANPNN:p-/N

Nigeria/violence,.NN:s-/N

no/energy/indipendence,.CNN:s-/N

nonproliferation/norm,.NN:s+/N

Nuclear/Non-Proliferation/Treaty,.ACNN:s-/N

nuclear/proliferation/in/Iran,.ANPN:s-/N

OECD/country,.NN:s+/N

oil/consumer,.NN:s+/N

oil/consumption,.NN:s+/N

oil/cutoff,.NN:s+/N

oil/demand,.NN:s+/N

oil/dependence,.NN:s-/N

oil/dependence/problem,.NNN:s+/N

oil/depletion,.NN:s-/N

oil/disruption,.NN:s+/N

oil/import,.NN:s+/N

oil/independence,.NN:s-/N

91

oil/market,.NN:s-/N

oil/market/condition,.NNN:s+/N

oil/market/volatility,.NNN:s-/N

oil/peak,.NN:s+/N

oil/price,.NN:s+/N

oil/price/impact,.NNN:s+/N

oil/price/stabilization,.NNN:s-/N

oil/price/volatility,.NNN:s+/N

oil/producer,.NN:s+/N

oil/production,.NN:s+/N

oil/profit,.NN:s+/N

oil/recovery/program,.NNN:s+/N

oil/reserve,.NN:s+/N

oil/security/project,.NNN:s+/N

oil/security/resource,.NNN:s+/N

oil/shock,.NN:s+/N

oil/stockpiling,.NN:s+/N

oil/stockpilings,oil/stockpiling.NN:p+/N

oil/supplier,.NN:s+/N

oil/supply,.NN:s+/N

oil/supply/disruption,.NNN:s+/N

oil/use,.NN:s+/N

oil/use/reduction,.NNN:s+/N

oil-driven/world/economy,.ANN:s+/N

Opec/cartel,.NN:s+/N

OPEC/member,.NN:s+/N

Opec/price/manipulation,.NNN:s+/N

Pakistan/instability,.NN:s-/N

partial/monopoly,.AN:s-/N

92

peak/oil,.NN:s-/N

peak/oil,.NN:s+/N

petroleum/dependence,.NN:s-/N

petroleum/import,.NN:s+/N

petroleum/security,.NN:s-/N

petroleum/stock,.NN:s+/N

petroleum/supply,.NN:s+/N

petroleum/vulnerability,.NN:s+/N

pipeline/incident,.NN:s+/N

pipeline/operator,.NN:s+/N

pipeline/safety,.NN:s-/N

pipeline/safety/program,.NNN:s+/N

pipeline/system,.NN:s+/N

policy/and/decision-making/process,.NCNVN:s+/N

policy/coordination,.NN:s-/N

policy/measure,.NN:s+/N

political/feasibility,.AN:s-/N

political/promise,.AN:s+/N

potential/cost/of/oil/dependence,.ANPNN:s+/N

potential/energy/crisis,.ANN:s+/N

promoting/alternative,.VN/V

public/transportation,.AN:s+/N

recession/fear,.NN:s+/N

regional/security,.AN:s-/N

reliable/energy,.AN:s+/N

reliable/supply,.AN:s+/N

reliance/on/foreign/sources/of/energy,.NPANPN:s-/N

renewable/energy/source,.ANN:s+/N

reserve/size,.NN:s+/N

93

resource/depletion,.NN:s-/N

resource/scarcity,.NN:s-/N

rising/energy/price,.ANN:s+/N

rising/terrorism,.AN:S-/N

risk/of/nuclear/proliferation,.NPAN:s+/N

sea/lane,.NN:s+/N

second/oil/imports/availability,.ANNN:s-/N

secure/supply/of/oil/and/gas,.ANPNCN:s+/N

security/cooperation,.NN:s-/N

security/of/supply,.NPN:s-/N

security/of/transit,.NPN:s-/N

security-environment/coalition,.NNN:s+/N

shortage/of/energy/supplies,.NPNN:s+/N

short-term/stability,.ANN:s-/N

Sino-U.S./energy/relation,.ANN:s+/N

small/number/of/supplies,.ANPN:s+/N

social/and/environmental/implication/of/the/pipeline’s/development,.ACANPDNPN:s+/N

sourcing/diversification,.NN:s+/N

speed/of/changes/in/the/climate/system,.NPNPDNN:s-/N

SPR,.N:s+/N

stability/of/nations/that/supply/energy,.NPNPVN:s-/N

stable/supply,.AN:s+/N

stockpile/expansion,.NN:s-/N

stockpile/size,.NN:s+/N

stockpile/use,.NN:s+/N

stockpiling/management,.NN:s-/N

strategic/fuel,.AN:s+/N

strategic/oil/reserve,.ANN:s+/N

strategic/oil/stockpiling,.ANN:s+/N

94

Strategic/Petroleum/Reserve,.ANN:s+/N

strategic/reserve,.AN:s+/N

strategic/reserves,strategic/reserve.AN:p+/N

supply/disruption,.NN:s+/N

supply/instability,.NN:s+/N

supporting/dictatorship,.AN:s+/N

sustainable/development/goal,.ANN:s+/N

sustainable/energy/future,.ANN:s-/N

technological/improvement,.AN:s+/N

technological/process,.AN:s+/N

the/largest/consumer/of/coal,.DANPN:s+/N

the/world’s/second/largest/emitter/of/greenhouse/gas/emissions,.DNPAANPNNN:s-/N

to diversify/supply,.VN/V

to/develop/cleaner/energy/resources,.PVANN/V

to/increase/use/of/gas,.PVNPN/V

transfer/of/wealth,.NPN:s-/N

United/States/oil/consumption,.ANNN:s-/N

unstable/foreign/oil/supplier,.AANN:s+/N

uranium/enrichment,.NN:s+/N

US/dependence/on/petroleum/inports,.ANNPNN:s-/N

vehicle/technology,.NN:s+N

vulnerable/economy,.AN:s+/N

vulnerable/energy/infrastructure,.ANN:s+/N

vulnerable/to/disruptions,.APN:s+/N

vulnerable/to/shortages,.APN:s+/A

vulnerable/to/shortages,.APN:s+/A

wasteful/use/of/resources,.ANPN:s+/N

water/resources/control,.NNN:s+/N

water/security,.NN:s-/N

95

water/security/crisis,.NNN:s+/N

water/shortage,.NN:s+/N

water/supply,.NN:s+/N

weak/dollar,.AN:s-/N

West-East/natural/gas/pipeline/project,.AANNN:s+/N

West-East/pipeline/project,.ANN:s+/N

world/oil/market,.NNN:s+/N

world/oil/price,.NNN:s+/N

96

Bibliography

Alani H., Kim S., Millard D. E., Weal M. J., Hall W., Lewis P. H. and Shadbolt N.

R., Automatic Ontology-Based Knowledge Extraction from Web Documents,

University of Southampton, 2003

Allemang D. Ontologies, Reuse and Domain Analysis, TopQuadrant, Inc., 2006

Arampatzis A., van der Weide T., Koster C. and van Bommel P. An evaluation of

linguistically-motivated indexing schemes. In Proceedings of the BCSIRSG

’2000, 2000

Arampatzis A., van der Weide T., Koster C. and van Bommel P. Linguistically

motivated information retrieval. In Allen Kent, editor, Encyclopedia of

Library and Information Science, volume 69. Marcel Dekker, Inc., New

York, Basel, December 2000. To appear. Currently available on-line from

http://www.cs.kun.nl/ avgerino/encyclopTR.ps.Z

Barker K. and Cornacchia N. Using noun phrase heads to extract document

keyphrases. In Proceedings of the Thirteenth Canadian Conference on

Artificial Intelligence, pages 40–52, 2000

Berners-Lee T.J., Hendler J., Lassila O. The Semantic Web, Scientific American,

May 2001, pp. 28-37

http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html

Berners-Lee T.J., Cailliau R., Groff J.F. The world-wide web, In Congrès Joint

European networking conference No3, Innsbruck, AUTRICHE 1992, vol. 25,

no 4-5 (270 p.) (6 ref.), pp. 454-459, 1992

Bourigault D. Surface grammatical analysis for the extraction of terminological

noun phrases. In Proceedings of COLING 92, 1992

97

Brambring M.. Mobility and orientation processes of the blind. In D. H. Warren

and E. R. Strelow, editors, Electronic Spatial Sensing for the Blind, pages

493–508, USA, 1984. Dordrecht, Lancaster, Nijhoff

Brent M.R. From grammar to lexicon: Unsupervised learning of lexical syntax.

Computational Linguistics 19:243-262, 1993

Brill E. Some advances in transformation-based part of speech tagging. In National

Conference on Artificial Intelligence, 1994

Caropreso M.F., Matwin S. and Sebastiani F. A learner-independent evaluation of

the usefulness of statistical phrases for automated text categorization. In

Amita G. Chin, editor, Text Databases and Document Management: Theory

and Practice, pages 78–102. Idea Group Publishing, Hershey, US, 2001

Carreras X., Màrques L. and Padrò L. Named entity extraction using adaboost. In

Proceedings of CoNLL-2002, pages 167–170. Taipei, Taiwan, 2002

Celjuska D. Semi-automatic Construction of Ontologies from Text. Master’s Thesis,

Department of Artificial Intelligence and Cybernetics, Technical University

Kosice, 2004

Celjuska D., Vargas M. Ontosophie: A Semi-Automatic System for Ontology

Population from Text, KMi - Knowledge Media Institute, The Open

University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom, 2004

Charniak E. Statistical Language Learning. MIT Press, 1993

Chieko A. and Lewis C. Home page reader: IBM’s talking web browser. In Closing

the Gap Conference Proceedings, 1998

98

Choueka Y. Looking for needles in a haystack or locating interesting collocational

expressions in large textual database. In Proceedings of the RIAO, pp. 38-43,

1998

Church K. and Hanks P. Word association, norms, mutual information, and

lexicography. Computational Linguistics, 16(1), 1990

Clark P. and Boswell R. Rule induction with CN2: Some recent improvements. In

Proceedings of the 5th European Working Sessions on Learning, pages 151-

163, Porto, Portugal, 1991

Cucerzan S. and Yarowsky D. Language independent named entity recognition

combining morphological and contextual evidence. In Proceedings of 1999

Joint SIGDAT Conference on EMNLP and VLC, 1999

Dagan I. and Itai A. Word sense disambiguation using a second language

monolingual corpus. Computational Linguistics, 20:563–596, 1994

Daille B., Gaussier E. and Lange J.M. Towards automatic extraction of

monolingual and bilingual terminology. In Proceedings of COLING 94, 1994

Damerau F. Generating and evaluating domain oriented multi-word terms from

texts. Information Processing and Management, 29(4), 1993

D’Avanzo E. Using Keyphrases fo Text Mining: Applications and Evaluation. PhD

Dissertation Series. Department of Information and Communication Sciences,

University of Trento. December 2005

D’Avanzo E., Elia A., Kuflik T., Vietri S. LAKE system at DUC 2007. In

Proceedings of the Document Understanding Conference, NAACL-HLT

2007, Rochester, NY, USA, April 22-27, 2007

99

D’Avanzo E., Magnini B. A Keyphrase-Based Approach to Summarization: the

LAKE System at DUC-2005. In Proceedings of Document Understanding

Workshop, HLT/EMNLP 2005, Vancouver, B.C., Canada, October 6-8, 2005

D’Avanzo E., Magnini B., Vallin A. Keyphrase Extraction for Summarization

Purposes: The LAKE System at DUC-2004. In Proceedings of Document

Understanding Workshop HLT/NAACL 2004. Boston, USA, May 6-7, 2004

Evans D.K., Klavans J.L. and Wacholder N. Document processing with linkit. In

Proceedings of the RIAO Conference, 2000

Fano R. Transmission of Information: A statistical theory of communications. MIT

press, MA, 1961

Fayyad U.M. and Irani K.B. Multi-interval discretization of continuous-valued

attributes for classification learning. In IJCAI, 1993

Fellbaum C. WordNet: An Electronic Lexical Database. MIT Press, 1998

Figenbaum E.A. and Feldman J., editors, Computers and Thought, McGraw-Hill,

New York, 1963

Fillmore C.J. The case for case. In Bach E. and Harms R.T., editors, Universal in

Linguistic Theory, pages 1-88. Holt, Rinheart and Winston, New York, 1968

Florian R., Ittycheriah A., Jing H. and Zhang T. Named entity recognition through

classifier combination. In Proceedings of CoNLL-2003, 2003

Frank E., Paynter G.W., Witten I.H., Gutwin C. and Craig G. Nevill-Manning.

Domain-specific keyphrase extraction. In IJCAI, pages 668–673, 1999

Geleijnse G., Korst J. Automatic Ontology Population by Googling, Philips

Research Laboratories, 2005

100

Gruber T.R. A Translation Approach to Portable Ontology Specifications.

Knowledge Acquisition, 5(2), 1993, pp. 199-220

Harper S. and Patel N. Gist summaries for visually impaired surfers. In

Proceedings of the 7th international ACM SIGACCESS conference on Computers

and accessibility, pp. 90-97, 2005

Harris Z.S. Notes au cours de syntaxe, Paris, Larousse, 1976

Harris Z.S. Papers in Structural and Transformational Linguistics, Reidel,

Dordrect, 1970

Harris Z.S. (1952), Discourse Analisys, in Harris 1970, pp. 313-347

Hearst M. A. Automatic Acquisition of Hyponyms from Large Text Corpora. In

Proceedings of the Fourteenth International Conference on Computational

Linguistics, 1992

Hulth A. Improved automatic keyword extraction given more linguistic knowledge.

In Empirical Methods in Natural Language Processing, 2003

Jackson P. and Moulinier I. Natural Language Processing for Online Applications.

John Benjamins Publishing Company, 2002

Jones S., Jones M. and Shaleen D.S. Using keyphrases as search result surrogates

on small screen devices. Personal Ubiquitous Comput., 8(1):55–68, 2004

Justeson J.S. and Katz S.M. Technical terminology: some linguistic properties and

an algorithm for identification in text. Natural Language Engineering, 1,

1995

Kantardzic M. Data Mining. IEEE Press, 2003

Kupiec J., Pedersen J. and Chen F. A Trainable Document Summarizer, Xerox Palo

Alto Research Center

101

Lovins J.B. Development of a stemming algorithm. Mechanical Translation and

Computational, 11:22–31, 1968

Manning C.D. and Schütze H. Foundations of Statistical Natural Language

Processing. The MIT Press Cambridge, Massachusetts, London, England,

1999

Marchionini G. Information Seeking in Electronic Environments. Cambridge

University Press, 1995

Mast M., Kummert F., Ehrlich U., Fink G.A., Kuhn T., Niemann H., Sagerer G. a

speech understanding and dialog system with homogeneous linguistic

knowledge base. IEEE Transaction on Pattern Analysis and Machine

Intelligence, 16(2):179-194, 1994

Meng Tan C., Fang Wang Y. and Do Lee C. The use of bigrams to enhance text

categorization. Inf. Process. Manage., 38(4):529–546, 2002

Miller G.A. WordNet: A lexical database for English. Communications of the

ACM 38(11):39-41, 1995

Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J., Tangi R. Five papers

on WordNet. Technical report, Princeton University, 1993.

ftp://ftp.cogsci.princeton.edu/pub/wordnet/5papers.ps. Cited 28 October

2005.

Mitchell T. Machine learning, McGraw Hill, 1997

Moens M.F. Automatic Indexing and Abstracting of Document Texts. Kluwer

Accademic, 2000

Nugues P.M. An introduction to language processing with Perl and Prolog. Berlin:

Springer, 2006

102

Palmer D.D. and Day D.S. A statistical profile of the named entity task. In Fifth

ACL Conference for Applied Natural Language Processing (ANLP-97), 1997

Pianta E., Bentivogli L. and Girardi C. Multiwordnet: developing an aligned

multilingual database. In Proceedings of the First International Conference

on Global WordNet, 2002

Poesio M. Domain modelling and NLP: Formal ontologies? Lexica? Or a bit of

both?, in Applied Ontology, 1(1):27–33, 2005

Quinlan J.R. Learning decision tree classifiers. ACM Comput. Surv., 28(1):71–72,

1996

Quinlan J.R. C4.5: programs for machine learning. Morgan Kaufmann Publishers

Inc., 1993

Quinlan J.R. Induction of decision trees. Machine Learning, 1(1):81–106, 1986

Resnik P. Selectional constraints: an information-theoretic model and its

computational realization. Cognition 61:127-159, 1996

Riloff, E. Automatically Generating Extraction Patterns from Untagged Text. In

AAAI-96. 1996

Salton G., editor. Automatic text processing. Addison-Wesley Longman Publishing

Co., Inc., Boston, MA, USA, 1988

Sanderson M. Reuters Test Collection. In BSC IRSG, 1994

Soderland W., Fisher D., Aseltine J., and Lehnert W. CRYSTAL: Inducing a

Conceptual Dictionary, in Proceedings of the International Joint Conference

on Artificial Intelligence, Montreal, Canada. pp. 1314-1319, 1995

Song M., Song I.Y. and Hu X. Kpspotter: a flexible information gain-based

keyphrase extraction system. In Proceedings of the fifth ACM international

103

workshop on Web information and data management, pages 50–53. ACM

Press, 2003

Turney P.D. Mining the web for lexical knowledge to improve keyphrase

extraction: Learning from labeled and unlabeled data. Technical Report

ERB-1096. (NRC #44947), National Research Council, Institute for

Information Technology, 2002

Turney P.D. Learning algorithms for keyphrase extraction. Information Retrieval, 2

(4):303–336, 2000

Turney P.D. Learning to extract keyphrases from text. Technical Report ERB-1057.

(NRC #41622), National Research Council, Institute for Information

Technology, 1999

Turney P.D. Extraction of keyphrases from text: Evaluation of four algorithms.

Technical Report ERB-1051. (NRC #41550), National Research Council,

Institute for Information Technology, 1997

Witten I.H. and Frank E. Data Mining: Practical Machine Learning Tools and

Techniques with Java Implementations. Morgan Kaufmann, 1999

Witten I.H., Paynter G.W., Frank E. Gutwin C. and Craig G. KEA: Practical

automatic keyphrase extraction. In ACM DL, pages 254–255, 1999