The role of linguistic information for shallow language processing

  • View
    2.427

  • Download
    1

Embed Size (px)

DESCRIPTION

Invited talk on at Knowledge Engineering: Principles and Techniques (KEPT2007), Cluj-Napoca, Romania, June 2007

Text of The role of linguistic information for shallow language processing

  • 1. The Role of Linguistic Information for Shallow Language Processing Constantin Orasan Research Group in Computational Linguistics University of Wolverhampton http://www.wlv.ac.uk/~in6093/

2.

  • We need to be able to process language automatically:
    • To have better access to information
    • To interact better with computers
    • To have texts translated from one language to another
  • so why not replicate the way humans process language?

3. Process language in a similar manner to humans

  • natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language (Waltz, 1982)
  • Also referred to as deep processing

4. Deep vs. shallow linguistic processing

  • Deep processing: tries to build an elaborated representation of the document in order to understand and make inferences
  • Shallow processing: extracts bits of information which could be useful for the task (e.g. shallow surface meaning), but no attempt is made to understand the document

5. Purpose of this talk

  • To show that deep processing has limited applicability
  • To show that it is possible to improve the performance of shallow methods by adding linguistic information
  • Text summarisation is taken as example

6. Structure

  • Introduction
  • FRUMP
  • Shallow processing for automatic summarisation
  • Evaluation
  • Conclusions

7. Automatic summarisation

  • Attempts to produce summaries using automatic means
  • Produces extracts:
    • extract and rearrange
    • Uses units from the source as such
  • Produces abstracts:
    • understand and generate
    • Rewords the information in the source

8. Automatic abstraction

  • Many methods try to replicate the way humans produce summaries
  • Very popular in the 1980s because it fit the overall AI trend
  • The abstracts are quite good in terms of coherence and cohesion
  • Tend to keep the information in some intermediate format

9. FRUMP

  • The most famous automatic abstracting system
  • Attempts to understand parts of the document
  • Uses 50 sketchy scripts
  • Discards information which is not relevant to the script
  • Words from the source are used to select the relevant script

10. Example of script

  • The ARREST script:
    • Police goes where the suspect is
    • There is optional fighting between the suspect and the police
    • The suspect is apprehended
    • The suspect is taken to a police station
    • The suspect is charged
    • The suspect is incarcerated or released on bond

11. System organisation

  • Relies on:
    • a PREDICTOR which takes the current context and predicts next events
    • a SUBSTANTIATOR which verifies and flesh out the predictions
  • If the PREDICTOR is wrong, it backtracks
  • The SUBSTANTIATOR relies on textual information and inferences

12. The output

  • Example of summary:
    • A bomb explosion in a Philippines Airlines jet has killed the person who planted the bomb and injured 3 people.
  • The output can be in several languages
  • It is very coherent and brief

13. Limitations

  • It works very well when it can understand the text, but
  • Language is ambiguous so it is common to misunderstand a text (e.g. Carter and Sadat embraced under a cherry tree in the White House garden, a symbolic gesture belying the differences between the two governmentsMEETING script)

14. Limitations (II)

  • It can handle only scripts which are predefined
  • In can deal only with information which is encoded in the scripts
  • It can make inferences only about concepts it knows
  • it is domain dependent and cannot be easily adapted to other domains

15. Limitations (III)

  • sometimes it can misunderstand some scripts with funny results:
  • Vatican City. The dead of the Pope shakes the world. He passed away
  • Summary:
  • Earthquake in the Vatican. One dead.

16.

  • natural language systems must not simply understand the shallow surface meaning of language, but must also be able to understand the deeper implications and inferences that a user is likely to intend and is likely to take from language (Waltz, 1982)
  • there seems to beno prospectfor anything other than narrow-domain natural-language systems for the foreseeable future (Waltz, 1982)

17. Automatic extraction

  • Users various shallow methods to determine which sentences are important
  • It is fairly domain independent
  • Extracts units (e.g. sentences, paragraphs) and usually presents them in the order they appear
  • The extracts are not very coherent, but they can give the gist of the text

18. Purpose of this research

  • Show how different types of linguistic information can be used to improve the quality of automatic summaries
  • Build automatic summarisers which relies on an increasing number of modules
  • Combine this information
  • Assess each of the summarisers

19. Setting of this research

  • A corpus of 65 scientific articles from JAIR was used
  • Over 600,000 words in total
  • They were in electronic format
  • Contain author produced summaries
  • 2%, 3%, 5%, 6% and 10% summaries are produced

20. Evaluation metric

  • Cosine similarity between the automatic extract and the human produced abstract
  • It would be very interesting to repeat the experiments using alternative evaluation metrics e.g. ROUGE

21. Extracts vs. abstracts

  • Human abstract
  • The main operations in Inductive Logic Programming(ILP) are generalization and specialization, which onlymake sense in a generality order.
  • Extract
  • S16Inductive Logic Programming (ILP) is a subfield of LogicProgramming and Machine Learning that tries to induceclausal theories from given sets of positive and negativeexamples.
  • S24The two main operations in ILP for modification of atheory are generalization and specialization.
  • S26These operations only make sense within a generalityorder.

22. Extracts vs. abstracts

  • Human abstract
  • The main operations in Inductive Logic Programming(ILP) are generalization and specialization, which onlymake sense in a generality order.
  • Extract
  • S16Inductive Logic Programming (ILP)is a subfield of LogicProgramming and Machine Learning that tries to induceclausal theories from given sets of positive and negativeexamples.
  • S24The twomain operations in ILPfor modification of atheoryare generalization and specialization .
  • S26These operations onlymake sense within a generality order .

23. Extracts vs. abstracts (II)

  • It is not possible to obtain 100% match between extracts and abstracts
  • There is somewhere an upper limit for extracts
  • This upper limit is represented by the set of sentences which maximise the similarity with human abstracts

24. Determining the upper limit

  • Try to find out the set of sentences which maximises the similarity with the human abstract
  • Two approaches:
    • Greedy algorithm
    • A genetic algorithm
  • More details in Orasan (2005)

25. The upper limit 26. Baseline

  • Is a very simple method which does not employ too much knowledge
  • The first and last sentence in the paragraphs were used

27. The upper and lower limit 28. Term-based summarisation

  • One of the most popular summarisation methods
  • It is rarely used on its own
  • Assumes that the importance of a sentence can be determined on the basis of the importance of words it contains
  • Various methods can be used to determine the importance of words

29. Term-frequency

  • The importance of a word is determined by how frequent it is
  • Not very good for very frequent words such as articles and prepositions
  • A stop list can be used to filter out such wo