The Role of Linguistic Information for Shallow Language Processing
Constantin OrasanResearch Group in Computational LinguisticsUniversity of Wolverhamptonhttp://www.wlv.ac.uk/~in6093/
KEPT2007 - 6th June 2007
We need to be able to process language automatically: To have better access to information To interact better with computers To have texts translated from one
language to another … so why not replicate the way
humans process language?
KEPT2007 - 6th June 2007
Process language in a similar manner to humans
… “natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language” (Waltz, 1982)
Also referred to as deep processing
KEPT2007 - 6th June 2007
Deep vs. shallow linguistic processing
Deep processing: tries to build an elaborated representation of the document in order to “understand” and make inferences
Shallow processing: extracts bits of information which could be useful for the task (e.g. shallow surface meaning), but no attempt is made to understand the document
KEPT2007 - 6th June 2007
Purpose of this talk
To show that deep processing has limited applicability
To show that it is possible to improve the performance of shallow methods by adding linguistic information
Text summarisation is taken as example
KEPT2007 - 6th June 2007
Structure
1. Introduction2. FRUMP3. Shallow processing for automatic
summarisation4. Evaluation5. Conclusions
KEPT2007 - 6th June 2007
Automatic summarisation
Attempts to produce summaries using automatic means
Produces extracts: extract and rearrange Uses units from the source as such
Produces abstracts: understand and generate Rewords the information in the source
KEPT2007 - 6th June 2007
Automatic abstraction
Many methods try to replicate the way humans produce summaries
Very popular in the 1980s because it fit the overall AI trend
The abstracts are quite good in terms of coherence and cohesion
Tend to keep the information in some intermediate format
KEPT2007 - 6th June 2007
FRUMP The most famous automatic
abstracting system Attempts to understand parts of the
document Uses 50 sketchy scripts Discards information which is not
relevant to the script Words from the source are used to
select the relevant script
KEPT2007 - 6th June 2007
Example of script
The ARREST script:1. Police goes where the suspect is2. There is optional fighting between the
suspect and the police3. The suspect is apprehended4. The suspect is taken to a police station5. The suspect is charged6. The suspect is incarcerated or released
on bond
KEPT2007 - 6th June 2007
System organisation Relies on:
a PREDICTOR which takes the current context and predicts next events
a SUBSTANTIATOR which verifies and flesh out the predictions
If the PREDICTOR is wrong, it backtracks
The SUBSTANTIATOR relies on textual information and inferences
KEPT2007 - 6th June 2007
The output
Example of summary: A bomb explosion in a Philippines Airlines jet has killed the person who planted the bomb and injured 3 people.
The output can be in several languages
It is very coherent and brief
KEPT2007 - 6th June 2007
Limitations It works very well when it can
understand the text, but … Language is ambiguous so it is
common to misunderstand a text (e.g. “Carter and Sadat embraced under a cherry tree in the White House garden, a symbolic gesture belying the differences between the two governments” MEETING script)
KEPT2007 - 6th June 2007
Limitations (II) It can handle only scripts which are
predefined In can deal only with information
which is encoded in the scripts It can make inferences only about
concepts it knows
… it is domain dependent and cannot be easily adapted to other domains
KEPT2007 - 6th June 2007
Limitations (III) sometimes it can misunderstand
some scripts with funny results:
Vatican City. The dead of the Pope shakes the world. He passed away …
Summary:
Earthquake in the Vatican. One dead.
KEPT2007 - 6th June 2007
… “natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language” (Waltz, 1982)
“…there seems to be no prospect for anything other than narrow-domain natural-language systems for the foreseeable future” (Waltz, 1982)
KEPT2007 - 6th June 2007
Automatic extraction Users various shallow methods to
determine which sentences are important
It is fairly domain independent Extracts units (e.g. sentences,
paragraphs) and usually presents them in the order they appear
The extracts are not very coherent, but they can give the gist of the text
KEPT2007 - 6th June 2007
Purpose of this research
Show how different types of linguistic information can be used to improve the quality of automatic summaries
Build automatic summarisers which relies on an increasing number of modules
Combine this information Assess each of the summarisers
KEPT2007 - 6th June 2007
Setting of this research
A corpus of 65 scientific articles from JAIR was used
Over 600,000 words in total They were in electronic format Contain author produced summaries 2%, 3%, 5%, 6% and 10% summaries
are produced
KEPT2007 - 6th June 2007
Evaluation metric Cosine similarity between the
automatic extract and the human produced abstract
It would be very interesting to repeat the experiments using alternative evaluation metrics e.g. ROUGE
KEPT2007 - 6th June 2007
Extracts vs. abstractsHuman abstract
The main operations in Inductive Logic Programming (ILP) are generalization and specialization, which
only make sense in a generality order.
Extract
S16 Inductive Logic Programming (ILP) is a subfield of Logic Programming and Machine Learning that tries to induce clausal theories from given sets of positive and negative examples.
S24 The two main operations in ILP for modification of a theory are generalization and specialization.
S26 These operations only make sense within a generality order.
KEPT2007 - 6th June 2007
Extracts vs. abstractsHuman abstract
The main operations in Inductive Logic Programming (ILP) are generalization and specialization, which
only make sense in a generality order.
Extract
S16 Inductive Logic Programming (ILP) is a subfield of Logic Programming and Machine Learning that tries to induce clausal theories from given sets of positive and negative examples.
S24 The two main operations in ILP for modification of a theory are generalization and specialization.
S26 These operations only make sense within a generality order.
KEPT2007 - 6th June 2007
Extracts vs. abstracts (II)
It is not possible to obtain 100% match between extracts and abstracts
There is somewhere an upper limit for extracts
This upper limit is represented by the set of sentences which maximise the similarity with human abstracts
KEPT2007 - 6th June 2007
Determining the upper limit
Try to find out the set of sentences which maximises the similarity with the human abstract
Two approaches: Greedy algorithm A genetic algorithm
More details in Orasan (2005)
KEPT2007 - 6th June 2007
The upper limit
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2% 3% 5% 6% 10%
Upper limit
KEPT2007 - 6th June 2007
Baseline
Is a very simple method which does not employ too much knowledge
The first and last sentence in the paragraphs were used
KEPT2007 - 6th June 2007
The upper and lower limit
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2% 3% 5% 6% 10%
Upper limit
Baseline
KEPT2007 - 6th June 2007
Term-based summarisation One of the most popular
summarisation methods It is rarely used on its own Assumes that the importance of a
sentence can be determined on the basis of the importance of words it contains
Various methods can be used to determine the importance of words
KEPT2007 - 6th June 2007
Term-frequency
The importance of a word is determined by how frequent it is
Not very good for very frequent words such as articles and prepositions
A stop list can be used to filter out such words
KEPT2007 - 6th June 2007
TF*IDF
Very popular method in IR and AS IDF = inverse document frequency A word which is frequent in a
collection of documents cannot be important for a document even if it is quite frequent
KEPT2007 - 6th June 2007
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2% 3% 5% 6% 10%
Upper limit
Baseline
TF
TF*IDF
KEPT2007 - 6th June 2007
0.4
0.41
0.420.43
0.44
0.45
0.46
0.470.48
0.49
0.5
5% 6% 10%
Upper limit
Baseline
TF
TF*IDF
KEPT2007 - 6th June 2007
Indicating phrases
Indicating phrases are groups of words which can indicate the importance or “un-importance” of a sentence
They are usually meta-discourse markers
They are genre dependent E.g. in this paper, we present, we
conclude that, for example, we believe
KEPT2007 - 6th June 2007
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
2% 3% 5% 6% 10%
Baseline
TF
TF*IDF
IP
KEPT2007 - 6th June 2007
More accurate word frequencies Words can be referred to by
pronouns, this means that … concepts represented by these words
do not get accurate frequency scores A pronoun resolution algorithm was
employed to determine the antecedents of pronouns …
and obtain more accurate frequency scores for words
KEPT2007 - 6th June 2007
Mitkov’s Anaphora Resolution System (MARS)
Relies on a set of boosting and impeding indicators to determine the antecedent from a set candidates: Prefer: subject, terms, closer candidates Penalise: indefinite NPs, far away candidates
A third of the pronouns in the corpus were annotated with anaphoric information
MARS: 51% success rate More in Mitkov, Evans and Orasan (2002)
KEPT2007 - 6th June 2007
0.35
0.4
0.45
0.5
0.55
2% 3% 5% 6% 10%
TF
TF*IDF
IP
TF+MARS
TF*IDF+MARS
KEPT2007 - 6th June 2007
Combination of modules Used a linear combination of the
previous modules: Term-based summariser enhanced with
anaphora resolution Indicating phrases Positional clues
The scores assigned by each module as normalised and each module obtained a weight of 1
KEPT2007 - 6th June 2007
0.35
0.4
0.45
0.5
0.55
2% 3% 5% 6% 10%
TF
TF*IDF
IP
TF+MARS
TF*IDF+MARS
Combination
KEPT2007 - 6th June 2007
Discourse information
Use a genetic algorithm to produce extracts which: Have the score assigned by the
“Combined” summariser high Consecutive sentences feature the same
entities Loosely implements the Centering
Theory
KEPT2007 - 6th June 2007
0.25
0.35
0.45
0.55
0.65
0.75
2% 3% 5% 6% 10%
Upper limit
Baseline
TF
TF*IDF
IP
TF+MARS
TF*IDF+MARS
Combination
Discourse
KEPT2007 - 6th June 2007
Conclusions
It is possible to improve the accuracy of shallow automatic summarisers by using additional linguistic information
The linguistic information is relatively simple and easy to obtain
… but things are not always the way expect (see Orasan 2006)
The methods are domain independent
KEPT2007 - 6th June 2007
0.25
0.35
0.45
0.55
0.65
0.75
2% 3% 5% 6% 10%
Upper limit
Baseline
TF
TF*IDF
IP
TF+MARS
TF*IDF+MARS
Combination
Discourse
KEPT2007 - 6th June 2007
Thank you