77
2006.04.25 - SLIDE 1 IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/ is240/s06/ Principles of Information Retrieval Lecture 27: NLP for IR

2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 1IS 240 – Spring 2006

Prof. Ray Larson

University of California, Berkeley

School of Information Management & Systems

Tuesday and Thursday 10:30 am - 12:00 pm

Spring 2006http://www.sims.berkeley.edu/academics/courses/is240/s06/

Principles of Information Retrieval

Lecture 27: NLP for IR

Page 2: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 2IS 240 – Spring 2006

Mini-TREC

• Proposed Schedule– February 14-16 – Database and previous

Queries– March 2 – report on system acquisition and

setup– March 2, New Queries for testing…– April 20, Results due– April 25, Results and system rankings (sort of)– May 9, Group reports and discussion

Page 3: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 3IS 240 – Spring 2006

Results (with bad runs)

Page 4: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 4IS 240 – Spring 2006

Compared to others…

Page 5: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 5IS 240 – Spring 2006

However…

• Errors in result creation means that results are not really comparable yet…

• For those who didn’t…You are welcome to submit runs with:– 1000 results per query– All data in appropriate locations in result set

Page 6: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 6IS 240 – Spring 2006

Today

• Review– Web Search Processing– Parallel Architectures (Inktomi - Brewer)– Cheshire III Design – GRID-based DLs

• NLP for IR

• Text Summarization

Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

Page 7: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 7IS 240 – Spring 2006

Google

• Google maintains (probably) the worlds largest Linux cluster (over 15,000 servers)

• These are partitioned between index servers and page servers– Index servers resolve the queries (massively

parallel processing)– Page servers deliver the results of the queries

• Over 8 Billion web pages are indexed and served by Google

Page 8: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 8IS 240 – Spring 2006

Ranking: Link Analysis

• Assumptions:– If the pages pointing to this page are good,

then this is also a good page– The words on the links pointing to this page

are useful indicators of what this page is about

– References: Page et al. 98, Kleinberg 98

Page 9: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 9IS 240 – Spring 2006

Ranking: PageRank

• Google uses the PageRank• We assume page A has pages T1...Tn which

point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. d is usually set to 0.85. C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

• PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

• Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one

Page 10: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 10IS 240 – Spring 2006

PageRank

T2Pr=1Pr=1

T1Pr=.725Pr=.725

T6Pr=1Pr=1

T5Pr=1Pr=1

T4Pr=1Pr=1

T3Pr=1Pr=1

T7Pr=1Pr=1

T8T8Pr=2.46625Pr=2.46625

X1 X2

APr=4.2544375Pr=4.2544375

Note: these are not real PageRanks, since they include values >= 1

Page 11: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 11IS 240 – Spring 2006

Page 12: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 12IS 240 – Spring 2006

Page 13: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 13IS 240 – Spring 2006

Page 14: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 14IS 240 – Spring 2006

Page 15: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 15IS 240 – Spring 2006

Digital Library Grid Initiatives:Cheshire3 and the Grid

Ray R. LarsonUniversity of California, Berkeley

School of Information Management and Systems

Rob SandersonUniversity of Liverpool

Dept. of Computer Science

Thanks to Dr. Eric Yen and Prof. Michael Buckland for parts of this presentation

Presentation from DLF Forum April 2005

Page 16: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 16IS 240 – Spring 2006

Overview

• The Grid, Text Mining and Digital Libraries– Grid Architecture– Grid IR Issues

• Cheshire3: Bringing Search to Grid-Based Digital Libraries– Overview– Grid Experiments– Cheshire3 Architecture– Distributed Workflows

Page 17: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 17IS 240 – Spring 2006

Grid

mid

dlew

are

Chem

i cal

Eng i

neer

i ng

Applications

ApplicationToolkits

GridServices

GridFabric

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

.….

Porta

ls

Rem

ote

sens

ors

..…Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture -- (Dr. Eric Yen, Academia Sinica,

Taiwan.)

Page 18: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 18IS 240 – Spring 2006

Chem

i cal

Eng i

neer

i ng

Applications

ApplicationToolkits

GridServices

GridFabric

Grid

mid

dlew

are

Clim

ate

Data

Grid

Rem

ote

Com

putin

g

Rem

ote

Visu

aliza

tion

Colla

bora

torie

s

High

ene

rgy

phy

sics

Cosm

olog

y

Astro

phys

ics

Com

bust

ion

Hum

anitie

sco

mpu

ting

Digi

tal

Libr

arie

s

Porta

ls

Rem

ote

sens

ors

Text

Min

ing

Met

adat

am

anag

emen

t

Sear

ch &

Retri

eval …

Protocols, authentication, policy, instrumentation,Resource management, discovery, events, etc.

Storage, networks, computers, display devices, etc.and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library

Workshop)

Bio-

Med

ical

Page 19: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 19IS 240 – Spring 2006

Grid-Based Digital Libraries

• Large-scale distributed storage requirements and technologies

• Organizing distributed digital collections• Shared Metadata – standards and

requirements• Managing distributed digital collections• Security and access control• Collection Replication and backup• Distributed Information Retrieval issues

and algorithms

Page 20: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 20IS 240 – Spring 2006

Grid IR Issues

• Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed)

• Very large-scale distribution of resources is a challenge for sub-second retrieval

• Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive

• In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

Page 21: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 21IS 240 – Spring 2006

Cheshire3 Overview

• XML Information Retrieval Engine – 3rd Generation of the UC Berkeley Cheshire system,

as co-developed at the University of Liverpool.– Uses Python for flexibility and extensibility, but

imports C/C++ based libraries for processing speed– Standards based: XML, XSLT, CQL, SRW/U, Z39.50,

OAI to name a few.– Grid capable. Uses distributed configuration files,

workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes.

– Free and Open Source Software. (GPL Licence)– http://www.cheshire3.org/ (under development!)

Page 22: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 22IS 240 – Spring 2006

Cheshire3 Server Overview

API

INDEXING

T R RX E AS C NL O ST R F D O R M S

SEARCH

P HR AO NT DO LC EO RL

DB API

REMOTESYSTEMS

(any protocol)

XMLCONFIG

& MetadataINFO

INDEXES

LOCAL DB

STAFF UI

CONFIG

NETWORK

RESULTSETS

SCAN

USERINFOC

ONFIG&CONTROL

ACCESSINFO

AUTHENTICATION

CLUSTERING

Native calls

Z39.50SOAPOAI

JDBC

Fetch IDPut ID

OpenURL

APACHE

INTERFACE

SERVERCONTROL

UDDIWSRP

SRW

Normalization

ClientUser/

Clients

OGIS

Cheshire3 SERVER

Page 23: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 23IS 240 – Spring 2006

Cheshire3 Grid Tests

• Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine)

• Using 16 processors with one “master” and 22 “slave” processes we were able to parse and index MARC data at about 13000 records per second

• On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

Page 24: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 24IS 240 – Spring 2006

SRB and SDSC Experiments

• We are working with SDSC to include SRB support

• We are planning to continue working with SDSC and to run further evaluations using the TeraGrid server(s) through a “small” grant for 30000 CPU hours– SDSC's TeraGrid cluster currently consists of 256 IBM cluster nodes,

each with dual 1.5 GHz Intel® Itanium® 2 processors, for a peak performance of 3.1 teraflops. The nodes are equipped with four gigabytes (GBs) of physical memory per node. The cluster is running SuSE Linux and is using Myricom's Myrinet cluster interconnect network.

• Planned large-scale test collections include NSDL, the NARA repository, CiteSeer and the “million books” collections of the Internet Archive

Page 25: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 25IS 240 – Spring 2006

Cheshire3 Object Model

UserStore

User

ConfigStoreObject

Database

Query

Record

Transformer

Records

ProtocolHandler

Normaliser

IndexStore

Terms

ServerDocument

Group

Ingest ProcessDocuments

Index

RecordStore

Parser

Document

Query

ResultSet

DocumentStore

Document

PreParserPreParserPreParser

Extracter

Page 26: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 26IS 240 – Spring 2006

Cheshire3 Data Objects

• DocumentGroup: – A collection of Document objects (e.g. from a file, directory, or

external search)

• Document:– A single item, in any format (e.g. PDF file, raw XML string,

relational table)

• Record:– A single item, represented as parsed XML

• Query:– A search query, in the form of CQL (an abstract query language for

Information Retrieval)

• ResultSet:– An ordered list of pointers to records

• Index:– An ordered list of terms extracted from Records

Page 27: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 27IS 240 – Spring 2006

Cheshire3 Process Objects

• PreParser: – Given a Document, transform it into another Document (e.g.

PDF to Text, Text to XML)

• Parser:– Given a Document as a raw XML string, return a parsed Record

for the item.

• Transformer:– Given a Record, transform it into a Document (e.g. via XSLT,

from XML to PDF, or XML to relational table)

• Extracter:– Extract terms of a given type from an XML sub-tree (e.g. extract

Dates, Keywords, Exact string value)

• Normaliser:– Given the results of an extracter, transform the terms,

maintaining the data structure (e.g. CaseNormaliser)

Page 28: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 28IS 240 – Spring 2006

Cheshire3 Abstract Objects

• Server: – A logical collection of databases

• Database:– A logical collection of Documents, their

Record representations and Indexes of extracted terms.

• Workflow:– A 'meta-process' object that takes a workflow

definition in XML and converts it into executable code.

Page 29: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 29IS 240 – Spring 2006

Workflow Objects

• Workflows are first class objects in Cheshire3 (though not represented in the model diagram)

• All Process and Abstract objects have individual XML configurations with a common base schema with extensions

• We can treat configurations as Records and store in regular RecordStores, allowing access via regular IR protocols.

Page 30: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 30IS 240 – Spring 2006

Workflow References

• Workflows contain a series of instructions to perform, with reference to other Cheshire3 objects

• Reference is via pseudo-unique identifiers … Pseudo because they are unique within the current context (Server vs Database)

• Workflows are objects, so this enables server level workflows to call database specific workflows with the same identifier

Page 31: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 31IS 240 – Spring 2006

Distributed Processing

• Each node in the cluster instantiates the configured architecture, potentially through a single ConfigStore.

• Master nodes then run a high level workflow to distribute the processing amongst Slave nodes by reference to a subsidiary workflow

• As object interaction is well defined in the model, the result of a workflow is equally well defined. This allows for the easy chaining of workflows, either locally or spread throughout the cluster.

Page 32: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 32IS 240 – Spring 2006

Workflow Example1

<subConfig id=“buildWorkflow”><objectType>workflow.SimpleWorkflow</objectType><workflow> <log>Starting Load</log> <object type=“recordStore” function=“begin_storing”/> <object type=“database” function=“begin_indexing”/> <for-each> <object type=“workflow” ref=“buildSingleWorkflow”> </for-each> <object type=“recordStore” function=“commit_storing”/> <object type=“database” function=“commit_indexing”/> <object type=“database” function=“commit_metadata”/></workflow></subConfig>

Page 33: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 33IS 240 – Spring 2006

Workflow Example2

<subConfig id=“buildSingleWorkflow”><objectType>workflow.SimpleWorkflow</objectType><workflow> <object type=“workflow” ref=“PreParserWorkflow”/> <try> <object type=“parser” ref=“NsSaxParser”/> </try> <except> <log>Unparsable Record</log> <raise/> </except> <object type=“recordStore” function=“create_record”/> <object type=“database” function=“add_record”/> <object type=“database” function=“index_record”/> <log>Loaded Record</log></workflow></subConfig>

Page 34: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 34IS 240 – Spring 2006

Workflow Standards

• Cheshire3 workflows do not conform to any standard schema

• Intentional:– Workflows are specific to and dependent on the

Cheshire3 architecture– Replaces the distribution of lines of code for

distributed processing– Replaces many lines of code in general

• Needs to be easy to understand and create• GUI workflow builder coming (web and

standalone)

Page 35: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 35IS 240 – Spring 2006

External Integration

• Looking at integration with existing cross-service workflow systems, in particular Kepler/Ptolemy

• Possible integration at two levels:– Cheshire3 as a service (black box) ... Identify

a workflow to call.– Cheshire3 object as a service (duplicate

existing workflow function) … But recall the access speed issue.

Page 36: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 36IS 240 – Spring 2006

Conclusions

• Scalable Grid-Based digital library services can be created and provide support for very large collections with improved efficiency

• The Cheshire3 IR and DL architecture can provide Grid (or single processor) services for next-generation DLs

• Available as open source via:http://cheshire3.sourceforge.net orhttp://www.cheshire3.org/

Page 37: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 37IS 240 – Spring 2006

Today

• Natural Language Processing and IR– Based on Papers in Reader and on

• David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan. 1996

• Text summarization: Lecture from Ed Hovy (USC)

Page 38: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 38IS 240 – Spring 2006

Natural Language Processing and IR

• The main approach in applying NLP to IR has been an attempt to address

– Phrase usage vs individual terms

– Search expansion using related terms/concepts

– Attempts to automatically exploit or assign controlled vocabularies

Page 39: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 39IS 240 – Spring 2006

NLP and IR

• Much early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms

corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)

– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

Page 40: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 40IS 240 – Spring 2006

NLP and IR

• Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between

terms has shown no improvement in performance over “bag of words” approaches

Page 41: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 41IS 240 – Spring 2006

General Framework of NLP

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 42: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 42IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 43: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 43IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 44: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 44IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

run

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 45: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 45IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 46: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 46IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 47: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 47IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition(Ananiadou)

Inflection/Derivation

Compounding

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 48: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 48IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 49: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 49IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge Incomplete Lexicons

Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 50: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 50IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 51: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 51IS 240 – Spring 2006

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 52: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 52IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 53: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 53IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Most words in Englishare ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 54: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 54IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 55: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 55IS 240 – Spring 2006

Structural Ambiguities

(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities Visiting relatives can be boring.

The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith

John bought a car with Mary.$3000 can buy a nice car.

Semantic Ambiguities(1)

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 56: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 56IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

CombinatorialExplosion

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 57: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 57IS 240 – Spring 2006

Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionariescomprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, LearningSlides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 58: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 58IS 240 – Spring 2006

Framework of IE

IE as compromise NLP

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 59: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 59IS 240 – Spring 2006

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 60: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 60IS 240 – Spring 2006

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 61: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 61IS 240 – Spring 2006

Techniques in IE

(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted

(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques

(4) Adaptation Techniques: Machine Learning, Trainable systems

(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 62: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 62IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning

Part of Speech TaggerFSA rulesStatistic taggers

95 %

F-Value90

DomainDependent

Local ContextStatistical Bias

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 63: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 63IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 64: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 64IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 65: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 65IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 66: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 66IS 240 – Spring 2006

Using NLP

• Strzalkowski (in Reader)

Text NLP represDbasesearch

TAGGERNLP: PARSER TERMS

Page 67: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 67IS 240 – Spring 2006

Using NLP

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

Page 68: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 68IS 240 – Spring 2006

Using NLP

TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

Page 69: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 69IS 240 – Spring 2006

Using NLP

PARSED SENTENCE

[assert

[[perf [have]][[verb[BE]]

[subject [np[n PRESIDENT][t_pos THE]

[adj[FORMER]][adj[SOVIET]]]]

[adv EVER]

[sub_ord[SINCE [[verb[INVADE]]

[subject [np [n TANK][t_pos A]

[adj [RUSSIAN]]]]

[object [np [name [WISCONSIN]]]]]]]]]

Page 70: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 70IS 240 – Spring 2006

Using NLP

EXTRACTED TERMS & WEIGHTS

President 2.623519 soviet 5.416102

President+soviet 11.556747 president+former 14.594883

Hero 7.896426 hero+local 14.314775

Invade 8.435012 tank 6.848128

Tank+invade 17.402237 tank+russian 16.030809

Russian 7.383342 wisconsin 7.785689

Page 71: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 71IS 240 – Spring 2006

Same Sentence, different sys

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCE (using uptagger from Tsujii)The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP ./.

Page 72: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 72IS 240 – Spring 2006

Same Sentence, different sys

CHUNKED Sentence (chunkparser – Tsujii)(TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (. .) ) )

Page 73: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 73IS 240 – Spring 2006

Same Sentence, different sys

Enju ParserROOT ROOT ROOT ROOT -1 ROOT been be VBN VB 5been be VBN VB 5 ARG1 President president NNP NNP 3been be VBN VB 5 ARG2 hero hero NN NN 8a a DT DT 6 ARG1 hero hero NN NN 8a a DT DT 11 ARG1 tank tank NN NN 13local local JJ JJ 7 ARG1 hero hero NN NN 8The the DT DT 0 ARG1 President president NNP NNP 3former former JJ JJ 1 ARG1 President president NNP NNP 3Russian russian JJ JJ 12 ARG1 tank tank NN NN 13Soviet soviet NNP NNP 2 MOD President president NNP NNP 3invaded invade VBD VB 14 ARG1 tank tank NN NN 13invaded invade VBD VB 14 ARG2 Wisconsin wisconsin NNP NNP 15has have VBZ VB 4 ARG1 President president NNP NNP 3has have VBZ VB 4 ARG2 been be VBN VB 5since since IN IN 10 MOD been be VBN VB 5since since IN IN 10 ARG1 invaded invade VBD VB 14ever ever RB RB 9 ARG1 since since IN IN 10

Page 74: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 74IS 240 – Spring 2006

NLP & IR

• Indexing– Use of NLP methods to identify phrases

• Test weighting schemes for phrases

– Use of more sophisticated morphological analysis

• Searching– Use of two-stage retrieval

• Statistical retrieval• Followed by more sophisticated NLP filtering

Page 75: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 75IS 240 – Spring 2006

NPL & IR

• Lewis and Sparck Jones suggest research in three areas– Examination of the words, phrases and sentences

that make up a document description and express the combinatory, syntagmatic relations between single terms

– The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching

– Using NLP-based methods for searching and matching

Page 76: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 76IS 240 – Spring 2006

NLP & IR Issues

• Is natural language indexing using more NLP knowledge needed?

• Or, should controlled vocabularies be used

• Can NLP in its current state provide the improvements needed

• How to test

Page 77: 2006.04.25 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.25 - SLIDE 77IS 240 – Spring 2006

NLP & IR

• New “Question Answering” track at TREC has been exploring these areas– Usually statistical methods are used to

retrieve candidate documents– NLP techniques are used to extract the likely

answers from the text of the documents