Invited talk on at Knowledge Engineering: Principles and Techniques (KEPT2007), Cluj-Napoca, Romania, June 2007
- 1. The Role of Linguistic Information for Shallow Language
Processing Constantin Orasan Research Group in Computational
Linguistics University of Wolverhampton
http://www.wlv.ac.uk/~in6093/
2.
- We need to be able to process language automatically:
-
- To have better access to information
-
- To interact better with computers
-
- To have texts translated from one language to another
- so why not replicate the way humans process language?
3. Process language in a similar manner to humans
- natural language systems must not simply understand the shallow
surface meaning of language, but must also be able to understand
the deeper implications and inferences that a user is likely to
intend and is likely to take from language (Waltz, 1982)
- Also referred to as deep processing
4. Deep vs. shallow linguistic processing
- Deep processing: tries to build an elaborated representation of
the document in order to understand and make inferences
- Shallow processing: extracts bits of information which could be
useful for the task (e.g. shallow surface meaning), but no attempt
is made to understand the document
5. Purpose of this talk
- To show that deep processing has limited applicability
- To show that it is possible to improve the performance of
shallow methods by adding linguistic information
- Text summarisation is taken as example
6. Structure
- Shallow processing for automatic summarisation
7. Automatic summarisation
- Attempts to produce summaries using automatic means
-
- Uses units from the source as such
-
- Rewords the information in the source
8. Automatic abstraction
- Many methods try to replicate the way humans produce
summaries
- Very popular in the 1980s because it fit the overall AI
trend
- The abstracts are quite good in terms of coherence and
cohesion
- Tend to keep the information in some intermediate format
9. FRUMP
- The most famous automatic abstracting system
- Attempts to understand parts of the document
- Discards information which is not relevant to the script
- Words from the source are used to select the relevant
script
10. Example of script
-
- Police goes where the suspect is
-
- There is optional fighting between the suspect and the
police
-
- The suspect is apprehended
-
- The suspect is taken to a police station
-
- The suspect is incarcerated or released on bond
11. System organisation
-
- a PREDICTOR which takes the current context and predicts next
events
-
- a SUBSTANTIATOR which verifies and flesh out the
predictions
- If the PREDICTOR is wrong, it backtracks
- The SUBSTANTIATOR relies on textual information and
inferences
12. The output
-
- A bomb explosion in a Philippines Airlines jet has killed the
person who planted the bomb and injured 3 people.
- The output can be in several languages
- It is very coherent and brief
13. Limitations
- It works very well when it can understand the text, but
- Language is ambiguous so it is common to misunderstand a text
(e.g. Carter and Sadat embraced under a cherry tree in the White
House garden, a symbolic gesture belying the differences between
the two governmentsMEETING script)
14. Limitations (II)
- It can handle only scripts which are predefined
- In can deal only with information which is encoded in the
scripts
- It can make inferences only about concepts it knows
- it is domain dependent and cannot be easily adapted to other
domains
15. Limitations (III)
- sometimes it can misunderstand some scripts with funny
results:
- Vatican City. The dead of the Pope shakes the world. He passed
away
- Earthquake in the Vatican. One dead.
16.
- natural language systems must not simply understand the shallow
surface meaning of language, but must also be able to understand
the deeper implications and inferences that a user is likely to
intend and is likely to take from language (Waltz, 1982)
- there seems to beno prospectfor anything other than
narrow-domain natural-language systems for the foreseeable future
(Waltz, 1982)
17. Automatic extraction
- Users various shallow methods to determine which sentences are
important
- It is fairly domain independent
- Extracts units (e.g. sentences, paragraphs) and usually
presents them in the order they appear
- The extracts are not very coherent, but they can give the gist
of the text
18. Purpose of this research
- Show how different types of linguistic information can be used
to improve the quality of automatic summaries
- Build automatic summarisers which relies on an increasing
number of modules
- Assess each of the summarisers
19. Setting of this research
- A corpus of 65 scientific articles from JAIR was used
- Over 600,000 words in total
- They were in electronic format
- Contain author produced summaries
- 2%, 3%, 5%, 6% and 10% summaries are produced
20. Evaluation metric
- Cosine similarity between the automatic extract and the human
produced abstract
- It would be very interesting to repeat the experiments using
alternative evaluation metrics e.g. ROUGE
21. Extracts vs. abstracts
- The main operations in Inductive Logic Programming(ILP) are
generalization and specialization, which onlymake sense in a
generality order.
- S16Inductive Logic Programming (ILP) is a subfield of
LogicProgramming and Machine Learning that tries to induceclausal
theories from given sets of positive and negativeexamples.
- S24The two main operations in ILP for modification of atheory
are generalization and specialization.
- S26These operations only make sense within a
generalityorder.
22. Extracts vs. abstracts
- The main operations in Inductive Logic Programming(ILP) are
generalization and specialization, which onlymake sense in a
generality order.
- S16Inductive Logic Programming (ILP)is a subfield of
LogicProgramming and Machine Learning that tries to induceclausal
theories from given sets of positive and negativeexamples.
- S24The twomain operations in ILPfor modification of atheoryare
generalization and specialization .
- S26These operations onlymake sense within a generality order
.
23. Extracts vs. abstracts (II)
- It is not possible to obtain 100% match between extracts and
abstracts
- There is somewhere an upper limit for extracts
- This upper limit is represented by the set of sentences which
maximise the similarity with human abstracts
24. Determining the upper limit
- Try to find out the set of sentences which maximises the
similarity with the human abstract
- More details in Orasan (2005)
25. The upper limit 26. Baseline
- Is a very simple method which does not employ too much
knowledge
- The first and last sentence in the paragraphs were used
27. The upper and lower limit 28. Term-based summarisation
- One of the most popular summarisation methods
- It is rarely used on its own
- Assumes that the importance of a sentence can be determined on
the basis of the importance of words it contains
- Various methods can be used to determine the importance of
words
29. Term-frequency
- The importance of a word is determined by how frequent it
is
- Not very good for very frequent words such as articles and
prepositions
- A stop list can be used to filter out such words
30. TF*IDF
- Very popular method in IR and AS
- IDF = inverse document frequency
- A word which is frequent in a collection of documents cannot be
important for a document even if it is quite frequent
31. 32. 33. Indicating phrases
- Indicating phrases are groups of words which can indicate the
importance or un-importance of a sentence
- They are usually meta-discourse markers
- E.g.in this paper, we present ,we conclude that ,for example,
we believe
34. 35. More accurate word frequencies
- Words can be referred to by pronouns, this means that
- concepts represented by these words do not get accurate
frequency scores
- A pronoun resolution algorithm was employed to determine the
antecedents of pronouns
- and obtain more accurate frequency scores for words
36. Mitkovs Anaphora Resolution System (MARS)
- Relies on a set of boosting and impeding indicators to
determine the antecedent from a set candidates:
-
- Prefer: subject, terms, closer candidates
-
- Penalise: indefinite NPs, far away candidates
- A third of the pronouns in the corpus were annotated with
anaphoric information
- More in Mitkov, Evans and Orasan (2002)
37. 38. Combination of modules
- Used a linear combination of the previous modules:
-
- Term-based summariser enhanced with anaphora resolution
- The scores assigned by each module as normalised and each
module obtained a weight of 1
39. 40. Discourse information
- Use a genetic algorithm to produce extracts which:
-
- Have the score assigned by the Combined summariser high
-
- Consecutive sentences feature the same entities
- Loosely implements the Centering Theory
41. 42. Conclusions
- It is possible to improve the accuracy of shallow automatic
summarisers by using additional linguistic information
- The linguistic information is relatively simple and easy to
obtain
- but things are not always the way expect (see Orasan 2006)
- The methods are domain independent
43. 44. Thank you