Download ppt - The role of linguistic information for shallow language processing

The Role of Linguistic Information for Shallow Language Processing

Constantin OrasanResearch Group in Computational LinguisticsUniversity of Wolverhamptonhttp://www.wlv.ac.uk/~in6093/

KEPT2007 - 6th June 2007

We need to be able to process language automatically: To have better access to information To interact better with computers To have texts translated from one

language to another … so why not replicate the way

humans process language?


Process language in a similar manner to humans

… “natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language” (Waltz, 1982)

Also referred to as deep processing


Deep vs. shallow linguistic processing

Deep processing: tries to build an elaborated representation of the document in order to “understand” and make inferences

Shallow processing: extracts bits of information which could be useful for the task (e.g. shallow surface meaning), but no attempt is made to understand the document


Purpose of this talk

To show that deep processing has limited applicability

To show that it is possible to improve the performance of shallow methods by adding linguistic information

Text summarisation is taken as example


Structure

1. Introduction2. FRUMP3. Shallow processing for automatic

summarisation4. Evaluation5. Conclusions


Automatic summarisation

Attempts to produce summaries using automatic means

Produces extracts: extract and rearrange Uses units from the source as such

Produces abstracts: understand and generate Rewords the information in the source


Automatic abstraction

Many methods try to replicate the way humans produce summaries

Very popular in the 1980s because it fit the overall AI trend

The abstracts are quite good in terms of coherence and cohesion

Tend to keep the information in some intermediate format


FRUMP The most famous automatic

abstracting system Attempts to understand parts of the

document Uses 50 sketchy scripts Discards information which is not

relevant to the script Words from the source are used to

select the relevant script


Example of script

The ARREST script:1. Police goes where the suspect is2. There is optional fighting between the

suspect and the police3. The suspect is apprehended4. The suspect is taken to a police station5. The suspect is charged6. The suspect is incarcerated or released

on bond


System organisation Relies on:

a PREDICTOR which takes the current context and predicts next events

a SUBSTANTIATOR which verifies and flesh out the predictions

If the PREDICTOR is wrong, it backtracks

The SUBSTANTIATOR relies on textual information and inferences


The output

Example of summary: A bomb explosion in a Philippines Airlines jet has killed the person who planted the bomb and injured 3 people.

The output can be in several languages

It is very coherent and brief


Limitations It works very well when it can

understand the text, but … Language is ambiguous so it is

common to misunderstand a text (e.g. “Carter and Sadat embraced under a cherry tree in the White House garden, a symbolic gesture belying the differences between the two governments” MEETING script)


Limitations (II) It can handle only scripts which are

predefined In can deal only with information

which is encoded in the scripts It can make inferences only about

concepts it knows

… it is domain dependent and cannot be easily adapted to other domains


Limitations (III) sometimes it can misunderstand

some scripts with funny results:

Vatican City. The dead of the Pope shakes the world. He passed away …

Summary:

Earthquake in the Vatican. One dead.


… “natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language” (Waltz, 1982)

“…there seems to be no prospect for anything other than narrow-domain natural-language systems for the foreseeable future” (Waltz, 1982)


Automatic extraction Users various shallow methods to

determine which sentences are important

It is fairly domain independent Extracts units (e.g. sentences,

paragraphs) and usually presents them in the order they appear

The extracts are not very coherent, but they can give the gist of the text


Purpose of this research

Show how different types of linguistic information can be used to improve the quality of automatic summaries

Build automatic summarisers which relies on an increasing number of modules

Combine this information Assess each of the summarisers


Setting of this research

A corpus of 65 scientific articles from JAIR was used

Over 600,000 words in total They were in electronic format Contain author produced summaries 2%, 3%, 5%, 6% and 10% summaries

are produced


Evaluation metric Cosine similarity between the

automatic extract and the human produced abstract

It would be very interesting to repeat the experiments using alternative evaluation metrics e.g. ROUGE


Extracts vs. abstractsHuman abstract

The main operations in Inductive Logic Programming (ILP) are generalization and specialization, which

only make sense in a generality order.

Extract

S16 Inductive Logic Programming (ILP) is a subfield of Logic Programming and Machine Learning that tries to induce clausal theories from given sets of positive and negative examples.

S24 The two main operations in ILP for modification of a theory are generalization and specialization.

S26 These operations only make sense within a generality order.


Extracts vs. abstractsHuman abstract

The main operations in Inductive Logic Programming (ILP) are generalization and specialization, which

only make sense in a generality order.

Extract

S16 Inductive Logic Programming (ILP) is a subfield of Logic Programming and Machine Learning that tries to induce clausal theories from given sets of positive and negative examples.

S24 The two main operations in ILP for modification of a theory are generalization and specialization.

S26 These operations only make sense within a generality order.


Extracts vs. abstracts (II)

It is not possible to obtain 100% match between extracts and abstracts

There is somewhere an upper limit for extracts

This upper limit is represented by the set of sentences which maximise the similarity with human abstracts


Determining the upper limit

Try to find out the set of sentences which maximises the similarity with the human abstract

Two approaches: Greedy algorithm A genetic algorithm

More details in Orasan (2005)


The upper limit

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2% 3% 5% 6% 10%

Upper limit


Baseline

Is a very simple method which does not employ too much knowledge

The first and last sentence in the paragraphs were used


The upper and lower limit

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2% 3% 5% 6% 10%

Upper limit

Baseline


Term-based summarisation One of the most popular

summarisation methods It is rarely used on its own Assumes that the importance of a

sentence can be determined on the basis of the importance of words it contains

Various methods can be used to determine the importance of words


Term-frequency

The importance of a word is determined by how frequent it is

Not very good for very frequent words such as articles and prepositions

A stop list can be used to filter out such words


TF*IDF

Very popular method in IR and AS IDF = inverse document frequency A word which is frequent in a

collection of documents cannot be important for a document even if it is quite frequent


0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2% 3% 5% 6% 10%

Upper limit

Baseline

TF

TF*IDF


0.4

0.41

0.420.43

0.44

0.45

0.46

0.470.48

0.49

0.5

5% 6% 10%

Upper limit

Baseline

TF

TF*IDF


Indicating phrases

Indicating phrases are groups of words which can indicate the importance or “un-importance” of a sentence

They are usually meta-discourse markers

They are genre dependent E.g. in this paper, we present, we

conclude that, for example, we believe


0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

2% 3% 5% 6% 10%

Baseline

TF

TF*IDF

IP


More accurate word frequencies Words can be referred to by

pronouns, this means that … concepts represented by these words

do not get accurate frequency scores A pronoun resolution algorithm was

employed to determine the antecedents of pronouns …

and obtain more accurate frequency scores for words


Mitkov’s Anaphora Resolution System (MARS)

Relies on a set of boosting and impeding indicators to determine the antecedent from a set candidates: Prefer: subject, terms, closer candidates Penalise: indefinite NPs, far away candidates

A third of the pronouns in the corpus were annotated with anaphoric information

MARS: 51% success rate More in Mitkov, Evans and Orasan (2002)


0.35

0.4

0.45

0.5

0.55

2% 3% 5% 6% 10%

TF

TF*IDF

IP

TF+MARS

TF*IDF+MARS


Combination of modules Used a linear combination of the

previous modules: Term-based summariser enhanced with

anaphora resolution Indicating phrases Positional clues

The scores assigned by each module as normalised and each module obtained a weight of 1


0.35

0.4

0.45

0.5

0.55

2% 3% 5% 6% 10%

TF

TF*IDF

IP

TF+MARS

TF*IDF+MARS

Combination


Discourse information

Use a genetic algorithm to produce extracts which: Have the score assigned by the

“Combined” summariser high Consecutive sentences feature the same

entities Loosely implements the Centering

Theory


0.25

0.35

0.45

0.55

0.65

0.75

2% 3% 5% 6% 10%

Upper limit

Baseline

TF

TF*IDF

IP

TF+MARS

TF*IDF+MARS

Combination

Discourse


Conclusions

It is possible to improve the accuracy of shallow automatic summarisers by using additional linguistic information

The linguistic information is relatively simple and easy to obtain

… but things are not always the way expect (see Orasan 2006)

The methods are domain independent


0.25

0.35

0.45

0.55

0.65

0.75

2% 3% 5% 6% 10%

Upper limit

Baseline

TF

TF*IDF

IP

TF+MARS

TF*IDF+MARS

Combination

Discourse


Thank you