44
ICT619 Intelligent ICT619 Intelligent Systems Systems Topic 9: Natural Topic 9: Natural Language Processing and Language Processing and Language Technology Language Technology

ICT619 Intelligent Systems Topic 9: Natural Language Processing and Language Technology

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

ICT619 Intelligent SystemsICT619 Intelligent Systems

Topic 9: Natural Language Topic 9: Natural Language Processing and Language Processing and Language TechnologyTechnology

ICT619ICT619 22

What is natural language processing What is natural language processing (NLP)?(NLP)?

An ideal goal for human-computer communication is An ideal goal for human-computer communication is the ability to communicate in a the ability to communicate in a natural languagenatural language

NLP grew as a sub-domain of AI and linguisticsNLP grew as a sub-domain of AI and linguistics- - the task of developing software capable of the task of developing software capable of understanding information (commands, text) expressed understanding information (commands, text) expressed in a natural language in order to achieve specific goalsin a natural language in order to achieve specific goals

Understanding natural languages is a challenging task Understanding natural languages is a challenging task for computersfor computers

Due to ambiguities, frequent use of context and the Due to ambiguities, frequent use of context and the overall knowledge acquisition and use problemoverall knowledge acquisition and use problem

ICT619ICT619 33

Speech (voice) recognition and natural Speech (voice) recognition and natural language processinglanguage processing

Speech recognitionSpeech recognition concerns understanding spoken concerns understanding spoken commands or sentences from voice inputs commands or sentences from voice inputs Example: Telstra’s directory assistanceExample: Telstra’s directory assistance

A speech recognition system must first extract and A speech recognition system must first extract and recognise words from audio inputrecognise words from audio input

We might also like the system to be able to answer in We might also like the system to be able to answer in speech - this requires speech - this requires speech generationspeech generation as well as well

In NLP, input is already available in machine-readable In NLP, input is already available in machine-readable form (eg words as Unicode text)form (eg words as Unicode text)

Future improvements of speech recognition will to Future improvements of speech recognition will to some extent depend on progress in NLPsome extent depend on progress in NLP

ICT619ICT619 44

Speech Recognnition – The state-Speech Recognnition – The state-of-the-artof-the-art

60-90% accuracy - good enough for general dictation60-90% accuracy - good enough for general dictation Speaker dependent – needs trainingSpeaker dependent – needs training Cheap desktop software availableCheap desktop software available Example: IBM ViaVoice, Dragon Naturally SpeakingExample: IBM ViaVoice, Dragon Naturally Speaking

Issues:Issues: Isolated vs. continuous speechIsolated vs. continuous speech Vocabulary sizeVocabulary size Better speaker independenceBetter speaker independence

ICT619ICT619 55

Language TechnologyLanguage Technology

Covers all areas related to NLP with a practical focusCovers all areas related to NLP with a practical focus

Language technology is defined as: Language technology is defined as: The application of knowledge about human language in The application of knowledge about human language in computer-based solutionscomputer-based solutions

Applications covered by language technology include:Applications covered by language technology include: Spoken language dialogue systems (speech recognition, Spoken language dialogue systems (speech recognition,

some understanding, and speech generation)some understanding, and speech generation) Machine translationMachine translation Text summarisationText summarisation Information retrievalInformation retrieval

ICT619ICT619 66

Language Technology (cont’d)Language Technology (cont’d)

The input to a language technology system The input to a language technology system may be provided throughmay be provided through speech recognitionspeech recognition optical character recognition (OCR)optical character recognition (OCR) handwriting recognitionhandwriting recognition

and and the output may be in the form of speech or the output may be in the form of speech or

tailored documents, or web pages.tailored documents, or web pages.

ICT619ICT619 77

Approaches to natural language Approaches to natural language processing processing

Main ApproachesMain Approaches Keyword searchingKeyword searching Linguistic analysisLinguistic analysis AI-basedAI-based ANN-basedANN-based Statistical analysisStatistical analysis

Keyword searching systemsKeyword searching systems Early NLP systems - and some in use today - are Early NLP systems - and some in use today - are

based on based on keyword searching (pattern matching)keyword searching (pattern matching)

ICT619ICT619 88

Keyword searching NLP systemsKeyword searching NLP systems

Selected keywords or phrases are searched for Selected keywords or phrases are searched for in the input sentencein the input sentence

The program responds with specific pre-stored The program responds with specific pre-stored responses based on the keywords or phrasesresponses based on the keywords or phrases

Program may actually construct a response Program may actually construct a response based on a partial reply coupled with keywords based on a partial reply coupled with keywords and phrases from the inputand phrases from the input

No real understanding of the input is involvedNo real understanding of the input is involved

ICT619ICT619 99

Keyword searching NLP systems Keyword searching NLP systems (cont’d)(cont’d)

The most well known exampleThe most well known example- ELIZA program from MIT mid-1960s - ELIZA program from MIT mid-1960s

ICT619ICT619 1010

Keyword systemsKeyword systems LimitationsLimitations

Inflexible - really just reactive responsesInflexible - really just reactive responses Unable to cope with anything not in their keyword Unable to cope with anything not in their keyword

look-up tables, and look-up tables, and No knowledge modellingNo knowledge modelling

Today’s more sophisticated NLP systems Today’s more sophisticated NLP systems Try to understand the content of language by doing Try to understand the content of language by doing

syntacticalsyntactical, , semanticsemantic and and pragmaticpragmatic analyses analyses May be able to do some conceptual modellingMay be able to do some conceptual modelling Better able to maintain continuous dialogues Better able to maintain continuous dialogues Attempt to cope with the ambiguity and other Attempt to cope with the ambiguity and other

features common in natural languagefeatures common in natural language

ICT619ICT619 1111

Other approaches to NLPOther approaches to NLP

Linguistic analysis approachLinguistic analysis approach Based on encoding formal grammar rules for Based on encoding formal grammar rules for

sentence-level processingsentence-level processing A linguistically-oriented system focuses on the A linguistically-oriented system focuses on the

syntax and semanticssyntax and semantics

AI based systemsAI based systems Focuses on using world knowledge to understand Focuses on using world knowledge to understand

languagelanguage One example of an AI-based NLP system is BORISOne example of an AI-based NLP system is BORIS

written by Michael Dyer, a student of Roger Schank'swritten by Michael Dyer, a student of Roger Schank's a story understanding program that reads a narrative and a story understanding program that reads a narrative and

answers questions about itanswers questions about it

ICT619ICT619 1212

AI-based NLP example - BORISAI-based NLP example - BORISRichard hadn’t heard from his college roommate Paul for years. Richard had borrowed Richard hadn’t heard from his college roommate Paul for years. Richard had borrowed money from Paul which was never paid back. But now he had no idea where to find his money from Paul which was never paid back. But now he had no idea where to find his old friend. When a letter finally arrived from San Francisco, Richard was anxious to find old friend. When a letter finally arrived from San Francisco, Richard was anxious to find out how Paul was.out how Paul was.

Q:Q: What happened to Richard at home?What happened to Richard at home?

BORIS: Richard got a letter from Paul.BORIS: Richard got a letter from Paul.

Q:Q: Who is Paul?Who is Paul?

BORIS: Richard’s friend.BORIS: Richard’s friend.

Q:Q: Did Richard want to see Paul?Did Richard want to see Paul?

BORIS: Yes, Richard wanted to know how Paul was.BORIS: Yes, Richard wanted to know how Paul was.

Q:Q: Had Paul helped Richard?Had Paul helped Richard?

BORIS: Yes, Paul lent money to Richard.BORIS: Yes, Paul lent money to Richard.

The BORIS system (from Roger Schank and Peter Childers, The BORIS system (from Roger Schank and Peter Childers, The Cognitive The Cognitive ComputerComputer).).

ICT619ICT619 1313

Artificial neural networks based NLPArtificial neural networks based NLP

ANN based systemsANN based systems Uses ANNs for processing language, particularly for Uses ANNs for processing language, particularly for

lexical disambiguationlexical disambiguation A neural net is trained to disambiguate by using A neural net is trained to disambiguate by using

context context Trained presents units of 6 or so words containing Trained presents units of 6 or so words containing

target word to be learnedtarget word to be learned

Example: Disambiguation of word “bank” in “We got a Example: Disambiguation of word “bank” in “We got a bank loan to buy a house”bank loan to buy a house”

Two possible senses: money sense, river senseTwo possible senses: money sense, river sense Groups of co-occurring words (neighbourhoods):Groups of co-occurring words (neighbourhoods):

Money sense: Money sense: bankbank money loan branch fee robbery money loan branch fee robbery River sense: River sense: bankbank river bridge erosion earth slope river bridge erosion earth slope

ICT619ICT619 1414

Statistical approach to NLPStatistical approach to NLP

Based on extracting statistically significant information - Based on extracting statistically significant information - tags - from large corpora or bodies of text (millions of tags - from large corpora or bodies of text (millions of words) and using these as very general indexes to model words) and using these as very general indexes to model parts or responsesparts or responses

Valuable because it does not require as much hand-Valuable because it does not require as much hand-modelling of knowledge, but acquires the tags modelling of knowledge, but acquires the tags automaticallyautomatically

Statistical methods are now receiving much attention, Statistical methods are now receiving much attention, and more systems are likely to incorporate them in future. and more systems are likely to incorporate them in future.

Most NLP systems use a combination of the linguistic Most NLP systems use a combination of the linguistic and AI approachesand AI approaches

Linguistic approachLinguistic approach

ICT619ICT619 1515

Components of NLP systemsComponents of NLP systems

Five major elements: the parser, the lexicon, the Five major elements: the parser, the lexicon, the semantic analyser, the knowledge base, and the semantic analyser, the knowledge base, and the generator generator

ICT619ICT619 1616

Components of NLP systems Components of NLP systems (cont’d)(cont’d)

A syntactical parser analyses the input sentence using A syntactical parser analyses the input sentence using the language's the language's grammargrammar or rules of syntax or rules of syntax

Output produced is a structural description of the Output produced is a structural description of the sentence - known as a sentence - known as a parse treeparse tree

Some rules of syntax for English:Some rules of syntax for English:

S = NP + VPS = NP + VPS : sentence NP: noun phrase VP: predicate or verb S : sentence NP: noun phrase VP: predicate or verb phrasephraseThe noun phrase can be more than a single noun The noun phrase can be more than a single noun NP = D + ADJ + NNP = D + ADJ + N

D: determiner (D) eg, “a”, “this”, ADJ: adjective, N: D: determiner (D) eg, “a”, “this”, ADJ: adjective, N: main nounmain noun

ICT619ICT619 1717

Components of NLP systemsComponents of NLP systems (cont.) (cont.)

The lexiconThe lexicon An internal dictionary An internal dictionary

used to perform the used to perform the syntactic and semantic syntactic and semantic analysisanalysis

Contains semantic and Contains semantic and grammatical information grammatical information (eg, part-of-speech) (eg, part-of-speech) about words or word about words or word stringsstrings

Fig. An example parse tree for the Fig. An example parse tree for the sentence “Mary had a little lamb”sentence “Mary had a little lamb”

ICT619ICT619 1818

The semantic analyser and the The semantic analyser and the knowledge baseknowledge base

The semantic analyser uses the parse tree and the The semantic analyser uses the parse tree and the knowledge base to try to determine what the sentence knowledge base to try to determine what the sentence means means

It creates another data structure that represents the It creates another data structure that represents the meaning of the input sentencesmeaning of the input sentences

It can also draw inferences from input statements using It can also draw inferences from input statements using general knowledge in the KBgeneral knowledge in the KB

The semantic analyser's data structure and those in the The semantic analyser's data structure and those in the KB should be in a common knowledge representation, KB should be in a common knowledge representation, such as KQML or Conceptual Graphs such as KQML or Conceptual Graphs

ICT619ICT619 1919

The GeneratorThe Generator

The generator uses the KB data structure created by the semantic The generator uses the KB data structure created by the semantic analyser to create a usable outputanalyser to create a usable output

The response depends in part on the pragmatics of the input The response depends in part on the pragmatics of the input language eg greetings require greetings, questions require language eg greetings require greetings, questions require answers, commands require actions answers, commands require actions

The data structure can be used to initiate some action, The data structure can be used to initiate some action,

eg the language system is a eg the language system is a front-endfront-end to a DBMS. The generator to a DBMS. The generator writes commands in a query language to begin a searchwrites commands in a query language to begin a search

Simple generators feed standard pre-stored output responses to Simple generators feed standard pre-stored output responses to the user based on the built meaning representationthe user based on the built meaning representation

More sophisticated generators construct an original response by More sophisticated generators construct an original response by instantiating templatesinstantiating templates based on models of language use based on models of language use

ICT619ICT619 2020

Applications of NLP - Applications of NLP - Natural language Natural language interfaces (NLI)interfaces (NLI)

An NLP system can be the front-end of information An NLP system can be the front-end of information systems to provide a more user-friendly interfacesystems to provide a more user-friendly interface

Eg, the command: “List details of all files in this folder Eg, the command: “List details of all files in this folder sorted by time of creation” much friendlier than ”ls –atl”, sorted by time of creation” much friendlier than ”ls –atl”, especially using voice inputespecially using voice input

An NLI processes sentences exchanged between a An NLI processes sentences exchanged between a user and an applicationuser and an application

Task made easier by the restricted domain of Task made easier by the restricted domain of discourse, eg, in databases due to their highly discourse, eg, in databases due to their highly restricted domain containing information on a single restricted domain containing information on a single area of applicationarea of application

Interfaces for expert systems, operating systems and Interfaces for expert systems, operating systems and document retrieval systems are also being developeddocument retrieval systems are also being developed

Homer: A Language-using AgentHomer: A Language-using Agent

Source: Vere, S. & Bickmore, T. A Basic Agent Computational Intelligence, 1990, 6, 4, 41-60.

ICT619ICT619 2222

Natural Language - HomerNatural Language - Homer

STEVE> What is in front of you?HOMER> A log.STEVE> Do you own the log?HOMER> No I don’t.STEVE> The log belongs to you.HOMER> Oh.STEVE> Cows eat grass.HOMER> I know.STEVE> Do you own the log now?HOMER> Yes I do.

TIM> Drop the package at the barge next Saturday at 9pm.HOMER> OK.TIM> Are you going to be at the pier next Saturday?HOMER> Yes.TIM> Why are you going to the pier?HOMER> I’ll go to the pier in order to pick up the package.TIM> What time are you going to pick it up?HOMER> 8:56pm.STEVE> Where will the package be next Saturday?HOMER> At the barge.

ICT619ICT619 2323

Examples of commercial NL : Intellect Examples of commercial NL : Intellect

Intellect (Trinzic Corp.)Intellect (Trinzic Corp.) One of the most widely used natural language front-One of the most widely used natural language front-

end interfaces available for mainframesend interfaces available for mainframes

Designed for use with DBMS under IBM operating Designed for use with DBMS under IBM operating systems environments systems environments

In addition to allowing access to data in a database, In addition to allowing access to data in a database, Intellect allows creation of databases using natural Intellect allows creation of databases using natural languagelanguage

The built-in lexicon may be modified to fit a particular The built-in lexicon may be modified to fit a particular applicationapplication

ICT619ICT619 2424

Q&A (Symantec Corp.)Q&A (Symantec Corp.)

A basic file manager with a natural language front-end called “The A basic file manager with a natural language front-end called “The Intelligent Assistant”Intelligent Assistant”

Parses common English input questions and converts them into Parses common English input questions and converts them into queries that the file manager can understand queries that the file manager can understand

Paraphrases input requests to ensure full understanding of what Paraphrases input requests to ensure full understanding of what user wantsuser wants

Eg, User input:Eg, User input:Show the total 1992 sales for the Central RegionShow the total 1992 sales for the Central Region

Q&A Intelligent Assistant’s response:Q&A Intelligent Assistant’s response:Shall I do the following?Shall I do the following?Create a report showing the amount of sales for Create a report showing the amount of sales for the the

central central region in 1992? region in 1992?

Y(es) – ContinueY(es) – Continue N(o) – Cancel requestN(o) – Cancel request

Semantec discontinued and then sold Q&A to a German company Semantec discontinued and then sold Q&A to a German company called CAB GmbH. called CAB GmbH.

ICT619ICT619 2525

Machine translationMachine translationGoal:Goal: To support translation of some language into a language To support translation of some language into a language

other than the originalother than the original

Applications include:Applications include: Desktop and web-based translation servicesDesktop and web-based translation services Spoken language translation services (eg phone-based)Spoken language translation services (eg phone-based)

Requirements:Requirements: Understanding meaning of input sentencesUnderstanding meaning of input sentences This would involve a semantic analysis of the input using This would involve a semantic analysis of the input using

semantic knowledgesemantic knowledge An automatic translation system is expected to be robust An automatic translation system is expected to be robust

and not stop whenever it encounters an item it cannot and not stop whenever it encounters an item it cannot understand understand

ICT619ICT619 2626

Machine translation (cont’d)Machine translation (cont’d)

Current approaches use a transfer grammarCurrent approaches use a transfer grammar Input text Input text Partial analysis Partial analysis 1st Intermediate 1st Intermediate

representation of content (related to the source representation of content (related to the source language)language)

Intermediate representation Intermediate representation Transformation using Transformation using a transfer grammar a transfer grammar 2nd intermediate representation 2nd intermediate representation (related to the target language)(related to the target language)

2nd intermediate representation 2nd intermediate representation NL generator NL generator Text in target languageText in target language

Machine translation as performed since mid-1960s is Machine translation as performed since mid-1960s is not true “understanding” of textnot true “understanding” of text

By 1991, systems that could process sentences with By 1991, systems that could process sentences with limited vocabulary started appearing limited vocabulary started appearing

ICT619ICT619 2727

Current state-of-the-art of machine Current state-of-the-art of machine translationtranslation

Broad coverage MT systems already available on the Web Broad coverage MT systems already available on the Web with fast turnaround time and acceptable error ratewith fast turnaround time and acceptable error rate

Higher accuracy achieved by domain-specific systemsHigher accuracy achieved by domain-specific systems For example, controlled language used in Caterpillar For example, controlled language used in Caterpillar

manuals manuals

Machine translation productsMachine translation products Bowne Global Solution’s iTranslatorBowne Global Solution’s iTranslator www.itranslator.comwww.itranslator.com

Systran’s Babel Fish (used by AltaVista)Systran’s Babel Fish (used by AltaVista) www.systransoft.comwww.systransoft.com

ICT619ICT619 2828

Current state-of-the-art of machine Current state-of-the-art of machine translation (cont’d)translation (cont’d)

An example: Systran’s Web-based TranslatorAn example: Systran’s Web-based Translator

ICT619ICT619 2929

Spoken language dialogue systemsSpoken language dialogue systems

Communicate with users via automatic speech recognition Communicate with users via automatic speech recognition and text-to-speech interfacesand text-to-speech interfaces

Mediate the user’s access to a back-end databaseMediate the user’s access to a back-end database

Examples:Examples: Information services: stock quotes, timetablesInformation services: stock quotes, timetables

Transaction services: banking, betting, flight reservationsTransaction services: banking, betting, flight reservations Current technology has been claimed to be capable of Current technology has been claimed to be capable of

reducing call centre costs from $75 to 18c a callreducing call centre costs from $75 to 18c a call

Some issues: Some issues: Telephony-based systems cannot afford a training periodTelephony-based systems cannot afford a training period Making a conversation too realistic falsely raises user Making a conversation too realistic falsely raises user

expectations and can confuse the systemexpectations and can confuse the system

ICT619ICT619 3030

Spoken language dialog systems Spoken language dialog systems (cont’d)(cont’d)

More issues: More issues: Error handling is a significant issueError handling is a significant issue Giving initiative to the user increases difficultyGiving initiative to the user increases difficulty

Some relatively successful examples:Some relatively successful examples: A Sydney taxi booking service (about 30% of cases have to A Sydney taxi booking service (about 30% of cases have to

go to human operators).go to human operators). Telstra directory assistance service (15-20% accuracy but Telstra directory assistance service (15-20% accuracy but

15-20% of automation may be useful enough)15-20% of automation may be useful enough)

Spoken language dialog systems fielded applications:Spoken language dialog systems fielded applications: Nuance (Nuance (www.nuance.comwww.nuance.com)) ScanSoft/SpeechWorks( (ScanSoft/SpeechWorks( (www.scansoft.comwww.scansoft.com)) Philips (Philips (www.speech.philips.comwww.speech.philips.com))

ICT619ICT619 3131

Text processingText processing

A number of different applications dealing with the A number of different applications dealing with the processing of continuous text may be grouped together processing of continuous text may be grouped together under this heading under this heading

Editing toolsEditing tools Most common example: spelling and syntax (or grammar) Most common example: spelling and syntax (or grammar)

checkers checkers Characterised by avoidance of deep semantic processingCharacterised by avoidance of deep semantic processing

Content extractionContent extraction Concerns extraction of specific information from texts Concerns extraction of specific information from texts Examples:Examples: Extraction of information related to financial transaction from Extraction of information related to financial transaction from

a bank telex or of bibliographic information from research a bank telex or of bibliographic information from research paperspapers

ICT619ICT619 3232

Text processing (cont’d)Text processing (cont’d)

Content extraction (cont’d)Content extraction (cont’d) Requires deep semantic analysis which is aided by the Requires deep semantic analysis which is aided by the

restricted domain and restricted domain and a prioria priori knowledge of the knowledge of the information to be extractedinformation to be extracted

Commercial systems exist for electronic mail Commercial systems exist for electronic mail processing, banking systems and automatic summary processing, banking systems and automatic summary generation generation

Examples:Examples: ATRANS from Cognitive SystemsATRANS from Cognitive Systems DEAL-READER from GecosysDEAL-READER from Gecosys

ICT619ICT619 3333

Text processing (cont.)Text processing (cont.)

Text summarisationText summarisationObjective:Objective: To produce a version of a document shorter than the To produce a version of a document shorter than the

original documentoriginal document Applications of text summarisation are found in Applications of text summarisation are found in

Information browsingInformation browsing Voice delivery of Web pages and emailVoice delivery of Web pages and email

Issues concerning text summarisationIssues concerning text summarisation Different kinds of summaries: Different kinds of summaries:

Indicative (what is it about?) vs Informative (what is there of Indicative (what is it about?) vs Informative (what is there of interest to user?)interest to user?)

Real summarisation requires real understandingReal summarisation requires real understanding

ICT619ICT619 3434

Text summarisation state-of-the-Text summarisation state-of-the-artart

Commercial systems work on a ‘sentence-extraction’ model Commercial systems work on a ‘sentence-extraction’ model Sentences regarded as ‘important’ are extracted and put Sentences regarded as ‘important’ are extracted and put togethertogether

Importance of sentences decided on the basis of location, Importance of sentences decided on the basis of location, inclusion of key words, statistical information such as inclusion of key words, statistical information such as frequencyfrequency

Current systems are relatively knowledge-free Current systems are relatively knowledge-free Not based on real understanding of the textNot based on real understanding of the text

Some text summarisation applications currently available:Some text summarisation applications currently available: CognIT’s CORPORUM (CognIT’s CORPORUM (www.cognit.comwww.cognit.com)) INXight’s Summarizer (INXight’s Summarizer (www.inxight.comwww.inxight.com)) MS Word’s summarisation toolMS Word’s summarisation tool

ICT619ICT619 3535

Search and Information RetrievalSearch and Information Retrieval Ever increasing amount of information available Ever increasing amount of information available

worldwide, particularly on the Internetworldwide, particularly on the Internet Searching for and retrieving information relevant to a Searching for and retrieving information relevant to a

topic of interest an active area of research and topic of interest an active area of research and application.application.

Document retrieval (DR)Document retrieval (DR) Also known as text retrievalAlso known as text retrieval Involves retrieving text ranging from paragraph to book Involves retrieving text ranging from paragraph to book

length for humans to readlength for humans to read DR may involveDR may involve

searching well-maintained bibliographic databasessearching well-maintained bibliographic databases scanning hard disks for missing filesscanning hard disks for missing files searching thousands of Web servers for natural language searching thousands of Web servers for natural language

articles on a topic of interestarticles on a topic of interest

ICT619ICT619 3636

Search and Information Retrieval Search and Information Retrieval (cont’d)(cont’d)

Efficacy of a DR system measured byEfficacy of a DR system measured by PrecisionPrecision –proportion retrieved that are relevant, and –proportion retrieved that are relevant, and RecallRecall –proportion of relevant documents retrieved –proportion of relevant documents retrieved

Retrieval depends on Retrieval depends on indexingindexing - indicating what documents are - indicating what documents are aboutabout

Indexing requires an Indexing requires an indexing languageindexing language, a , a termterm vocabulary, and a vocabulary, and a method for constructing requests and document descriptions method for constructing requests and document descriptions

Both controlled language indexing and the more sophisticated Both controlled language indexing and the more sophisticated natural language indexing require NLP capabilitiesnatural language indexing require NLP capabilities

Compact descriptions of a document’s significance may increase Compact descriptions of a document’s significance may increase the efficiency of matchingthe efficiency of matching

Increasing both recall and precision is the fundamental goal of Increasing both recall and precision is the fundamental goal of index languagesindex languages

ICT619ICT619 3737

Search and Information Retrieval Search and Information Retrieval (cont’d)(cont’d)

Current topics of interest in search and information retrieval include:Current topics of interest in search and information retrieval include: In a concept-based search, documents are characterised by In a concept-based search, documents are characterised by

relevant concepts and not just key wordsrelevant concepts and not just key words

For example, a search for ‘car’ should also retrieve documents on For example, a search for ‘car’ should also retrieve documents on 'automobiles''automobiles'

Named entity recognition involves recognising names of peoples, Named entity recognition involves recognising names of peoples, places, organisations etc. places, organisations etc.

One person or organisation can be referred to by many name One person or organisation can be referred to by many name variants – eg, John Howard, Mr. Howard, J.W. Howard, the PMvariants – eg, John Howard, Mr. Howard, J.W. Howard, the PM

Many persons or organisations can share the same name – eg, Many persons or organisations can share the same name – eg, politician John Howard, actor John Howardpolitician John Howard, actor John Howard

ICT619ICT619 3838

Search and Information Retrieval Search and Information Retrieval (cont’d)(cont’d)

Search and Information Retrieval State-of-the-artSearch and Information Retrieval State-of-the-art Current trend (eg Google) is to expand the search Current trend (eg Google) is to expand the search

vocabulary by using thesauri (eg, ‘car’ vocabulary by using thesauri (eg, ‘car’ ‘automobile’) ‘automobile’) Linguistic analysis to identify phrases relevant to the initial Linguistic analysis to identify phrases relevant to the initial

queryquery

Key phrases can be more useful than just key wordKey phrases can be more useful than just key word Can be used to expand an initial user query (Khan & Khor Can be used to expand an initial user query (Khan & Khor

2004)2004)

Some current search and information retrieval applications:Some current search and information retrieval applications: Ultra Find: Ultra Find: www.ultradesign.com/untrafind/ultrafind.htmlwww.ultradesign.com/untrafind/ultrafind.html Lotus Discovery Server: Lotus Discovery Server:

www.lotus.com/products/discserver.nsfwww.lotus.com/products/discserver.nsf Smart text processing suites:Smart text processing suites: Inxight: Inxight: www.inxight.comwww.inxight.com Verity: wwwl.verity.comVerity: wwwl.verity.com

ICT619ICT619 3939

Challenges faced by NLPChallenges faced by NLP

A good NLP system must be capable of handling A good NLP system must be capable of handling common linguistic problems caused by ambiguities and common linguistic problems caused by ambiguities and the use of contextthe use of context

Prepositional phrase attachmentPrepositional phrase attachment A sentence can often be analysed in more than one A sentence can often be analysed in more than one

way, producing multiple parse trees for the sentence. way, producing multiple parse trees for the sentence. Example sentence:Example sentence: “ “John saw John saw the boy in the park with a telescopethe boy in the park with a telescope” ”

has 3 possible parseshas 3 possible parsesWithout contextual knowledge, it is not known whether Without contextual knowledge, it is not known whether

John was looking through the telescope, the boy had a John was looking through the telescope, the boy had a telescope, or the park had a telescope in it. telescope, or the park had a telescope in it.

ICT619ICT619 4040

Challenges faced by NLP (cont’d)Challenges faced by NLP (cont’d)

Lexical ambiguityLexical ambiguity When words have multiple meaningsWhen words have multiple meanings A classic example:A classic example:

Time flies like an arrow.Time flies like an arrow. Fruit flies like a banana.Fruit flies like a banana.

In the first case, “flies” is a verb and “like” is an In the first case, “flies” is a verb and “like” is an adverbadverb

In the second case, “flies” is a noun and “like” In the second case, “flies” is a noun and “like” is a verb.is a verb.

ICT619ICT619 4141

Challenges faced by NLP (cont.) Challenges faced by NLP (cont.)

Anaphoric referenceAnaphoric reference or or pronoun resolutionpronoun resolution Problem of figuring out what a pronoun refers to Problem of figuring out what a pronoun refers to Example:Example:

Give me the names of all managers and how much they Give me the names of all managers and how much they earn.earn. (1)(1)Mary went to see Jane. She was happy to see herMary went to see Jane. She was happy to see her (2)(2)

In (1), easy to decide that “they” refers to the managers In (1), easy to decide that “they” refers to the managers already mentionedalready mentioned

In (2), difficult to decide who “she” and “her” refer to In (2), difficult to decide who “she” and “her” refer to – – was Mary happy to see Jane, or was Jane happy to was Mary happy to see Jane, or was Jane happy to see Mary? see Mary?

ICT619ICT619 4242

Challenges faced by NLP (cont.) Challenges faced by NLP (cont.)

EllipsisEllipsis Sentences appearing to have parts missing Sentences appearing to have parts missing Example Example John works in Personnel, Mary in AccountingJohn works in Personnel, Mary in Accounting..

““Mary in accounting” lacks a verb but is Mary in accounting” lacks a verb but is understandable using context of entire understandable using context of entire sentence sentence

““Mary in accounting” is an elliptical form of Mary in accounting” is an elliptical form of “Mary works in accounting”.“Mary works in accounting”.

ICT619ICT619 4343

Challenges faced by NLP (cont.) Challenges faced by NLP (cont.)

Quantifier scopeQuantifier scope Quantifiers such as “all”, “every”, “some”, and “no” can Quantifiers such as “all”, “every”, “some”, and “no” can

be ambiguousbe ambiguous

Example:Example: Every employee does not like Mr SmithEvery employee does not like Mr Smith

Meaning - not a single employee likes Mr SmithMeaning - not a single employee likes Mr Smithor - some do and some don’t. or - some do and some don’t.

No current NLP system can handle all of these No current NLP system can handle all of these problems – no unrestricted NLP system yetproblems – no unrestricted NLP system yet

Yet some such as HOMER can handle the most Yet some such as HOMER can handle the most common formscommon forms

ICT619ICT619 4444

REFERENCESREFERENCES Germain, E., Germain, E., Introducing Natural Language ProcessingIntroducing Natural Language Processing, AI Expert, , AI Expert,

August 1992, pp.30-35.August 1992, pp.30-35. Lewis, D.D., and Jones, K.S., Lewis, D.D., and Jones, K.S., Natural Language Processing for Natural Language Processing for

Information retrievalInformation retrieval, Communications of the ACM Vol. 39, No. 1 , Communications of the ACM Vol. 39, No. 1 (January 1996), pp.92-100.(January 1996), pp.92-100.

Turban, E., Turban, E., Decision Support and Expert SystemsDecision Support and Expert Systems, Prentice Hall, , Prentice Hall, Englewood Cliffs, New Jersey, 1995, pp. 242-257.Englewood Cliffs, New Jersey, 1995, pp. 242-257.

Thayse, A. (Editor), Thayse, A. (Editor), From Natural Language Processing to Logic for From Natural Language Processing to Logic for Expert SystemsExpert Systems, John Wiley & Sons, 1991., John Wiley & Sons, 1991.

Cole, R., Zaenen A., & Zampolli (eds), Cole, R., Zaenen A., & Zampolli (eds), Survey of the State of the Art Survey of the State of the Art in Human Language technologyin Human Language technology, Cambridge University Press, 1998, Cambridge University Press, 1998

Available on the web: Available on the web: http://http://cslu.cse.ogi.edu/HLTsurveycslu.cse.ogi.edu/HLTsurvey// Dale, R.,Dale, R., Language Technology: Applications and Techniques Language Technology: Applications and Techniques

Tutorial 2004, Tutorial 2004, The 8th Pacific Rim Int. Conf. on Artificial Intelligence, The 8th Pacific Rim Int. Conf. on Artificial Intelligence, Auckland, 9-13 August, 2004.Auckland, 9-13 August, 2004.

Khan, M.S., and Khor, S. “Automatic Query Expansion for Enhanced Khan, M.S., and Khor, S. “Automatic Query Expansion for Enhanced Web Document Retrieval”, Journal of the American Society for Web Document Retrieval”, Journal of the American Society for Information Science and Technology, Vol. 55, No. 1, 2004, pp.29-40.Information Science and Technology, Vol. 55, No. 1, 2004, pp.29-40.