Natural Language Processing and Information Retrieval ...disi.unitn.it/moschitti/Teaching-slides/NLP-IR/Intro-to-NLP-IR.pdf · Introduction to Natural Language Processing, Computational

Alessandro Moschitti Department of Computer Science and Information

Engineering University of Trento

Email: [email protected]

Natural Language Processing and Information Retrieval

Course Description

Course Schedule

  Lectures   Tuesday, 8:30 - 10:30

  Thursday, 10:30 - 12:30

  Room 107

  Some lectures in lab

  Consulting hours:   My office at third floor

  Monday since14:00 to 15:30

  Sending email is recommended

Syllabus

  Introduction to Information Retrieval (IR)   Boolean retrieval, Vector Space Model, Feature

Vectors, Document/Passage Retrieval, Search Engines, Relevance Feedback & Query Expansion, Document Filtering and Categorization, flat and hierarchical clustering, Latent Semantic Analysis, Web Crawling and the Google algorithm.

  Statistical Machine Learning:   Kernel Methods, Classification, Clustering, Ranking,

Re-Ranking and Regression and hints to practical machine learning.

Syllabus

  Performance Evaluation:   Performance Measures, Performance Estimation,

Cross validation, Held Out and n-Fold Cross validation

  Statistical Natural Language Processing:   Sequence Labeling: POS-tagging, Named Entity

Recognition and Normalization.

  Syntactic Parsing: shallow and deep Constituency Parsing, Dependency Syntactic Parsing.

  Shallow Semantic Parsing: Predicate Argument Structures, SRL of FrameNet and ProbBank, Relation Extraction (supervised and semi-supervised).

  Discourse Parsing: Coreference Resolution and discourse connective classification

Syllabus

  Joint NLP and IR applications:   Deep Linguistic Analysis for Question Answering: QA

tasks (open, restricted, factoid, non-factoid), NLP Representation, Question Answering Workflow, QA Pipeline, Question Classification and QA reranking.

  Fine-Grained Opinion Mining: automatic review classification, deep opinion analysis, automatic product extraction and review, reputation/social media analysis

Lab

  Search Engines

  Automated Text Categorization

  Syntactic Parsing and Named Entity Recognition

  Question Classification (Question Answering)

Where to study?

  Course Slides at http://disi.unitn.it/moschitti/

teaching.html   NLP-IR section

  Book - IR:   Modern Information Retrieval Authors:Ricardo A.

Baeza-Yates. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA ©1999 ISBN:020139829X

  IIR: Introduction to Information Retrieval. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. Cambridge University Press, 2008.

Where to study?

  Book – NLP:   Foundations of Statistical Natural Language

Processing. Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999

  SPEECH and LANGUAGE PROCESSING.An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition Second Edition by Daniel Jurafsky and James H. Martin

Where to study?

  Course Slides at http://disi.unitn.it/moschitti/

teaching.html

  NLP-IR section:   Slides of IIR available at: http://informationretrieval.org

Material

  Slides at

Reference Book

Motivation

  Why NLP and IR?

  IR studies methods to search and retrieve

information   Basic models based on words

  Pretty much statistical-based

  NLP studies automatic approach to understand

and language geneation   Use complex structures: syntax and semantics

  Logic-based but nowadays pretty much statistical too

Motivation

  IR very successful   Google Inc.

  Altavista born in 1995

  NLP pretty much unsuccessful for company

purposes

  Why using NLP?

Motivations

  Let us ask

  Who is the President of the United States?

(Yes) The president of the United States is Barack Obama

(no) Glenn F. Tilton is President of the United Airlines

Motivations

  TREC has taught that this model is too weak

  Consider a more complex task, i.e., a Jeopardy! Quiz

show question

  When hit by electrons, a phosphor gives off

electromagnetic energy in this form

  Solutions: photons/light

  What are the most similar fragments retrieved by a search

engine?

Motivations

  This shows that:

  Word matching is not enough

  Structure is required

  What kind of structures do we need?

  How to carry out structural similarity?

  Still not complete solved problem but…

Today

  Basic Concept of Search Engines

• 21

Unstructured (text) vs. structured (database) data in 1996

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

• 22

Unstructured (text) vs. structured (database) data in 2006

0

20

40

60

80

100

120

140

160

Data volume Market Cap

Unstructured

Structured

Unstructured data in 1650

  Which plays of Shakespeare contain the words

Brutus AND Caesar but NOT Calpurnia?

  One could grep all of Shakespeare’s plays for

Brutus and Caesar, then strip out lines

containing Calpurnia?   Slow (for large corpora)

NOT Calpurnia is non-trivial

  Other operations (e.g., find the word Romans near countrymen) not feasible

  Ranked retrieval (best documents to return)

Term-document incidence

1 if play contains word, 0 otherwise

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 1 1 0 0 0 1

Brutus 1 1 0 1 0 0

Caesar 1 1 0 1 1 1

Calpurnia 0 1 0 0 0 0

Cleopatra 1 0 0 0 0 0

mercy 1 0 1 1 1 1

worser 1 0 1 1 1 0

Brutus AND Caesar but NOT Calpurnia

Incidence vectors

  So we have a 0/1 vector for each term.

  To answer query: take the vectors for Brutus,

Caesar and Calpurnia (complemented) ➨ bitwise

AND.

  110100 AND 110111 AND 101111 = 100100.

Answerstoquery

  Antony and Cleopatra,Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.

  Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the

Capitol; Brutus killed me.

• Sec. 1.1

Bigger corpora

  Consider N = 1M documents, each with about 1K

terms.

  Avg 6 bytes/term incl spaces/punctuation   6GB of data in the documents.

  Say there are m = 500K distinct terms among

these.

Can’t build the matrix

  500K x 1M matrix has half-a-trillion 0’s and 1’s.

  But it has no more than one billion 1’s.   matrix is extremely sparse.

  What’s a better representation?   We only record the 1 positions.

Why?

Inverted index

For each term T, we must store a list of all

documents that contain T.

Do we use an array or a list for this?

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

What happens if the word Caesar is added to document 14?

Inverted index

  Linked lists generally preferred to arrays   Dynamic space allocation

  Insertion of terms into documents easy

  Space overhead of pointers

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (more later on why).

Posting

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

More on these later.

Documents to be indexed.

Friends, Romans, countrymen.

Indexer steps

  Sequenceof(Modifiedtoken,DocumentID)pairs.

I did enact Julius Caesar I was killed

i' the Capitol; Brutus killed me.

Doc 1

So let it be with Caesar. The noble

Brutus hath told you Caesar was ambitious

Doc 2

  Sort by terms.

Core indexing step.

Indexersteps:Dic5onary&Pos5ngs

  Mul%pletermentriesinasingledocumentaremerged.

  SplitintoDic%onaryandPos%ngs

  Doc.frequencyinforma%onisadded.

Why frequency?

Will discuss later.

Wheredowepayinstorage?

Pointers

Terms and counts

Later in the course:

How do we index

efficiently? How much

storage do we need?

Sec. 1.2

Lists of docIDs

• 36

The index we just built

  How do we process a query?   Later - what kinds of queries can we process?

Today’s focus

Query processing: AND

Consider processing the query: Brutus AND Caesar

Locate Brutus in the Dictionary; Retrieve its postings.

Locate Caesar in the Dictionary; Retrieve its postings.

“Merge” the two postings:

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

Brutus

Caesar

34

1282 4 8 16 32 64

1 2 3 5 8 13 21

The merge

Walk through the two postings simultaneously, in

time linear in the total number of postings entries

128

34

2 4 8 16 32 64

1 2 3 5 8 13 21

BrutusCaesar2 8

If the list lengths are x and y, the merge takes O(x+y)

operations.

Crucial: postings sorted by docID.

Intersec5ngtwopos5ngslists(a“merge”algorithm)

Boolean queries: Exact match

  The Boolean Retrieval model is being able to ask a

query that is a Boolean expression:   Boolean Queries are queries using AND, OR and NOT to

join query terms   Views each document as a set of words   Is precise: document matches condition or not.

  Primary commercial retrieval tool for 3 decades.

  Professional searchers (e.g., lawyers) still like

Boolean queries:   You know exactly what you’re getting.

Example: WestLaw http://www.westlaw.com/

  Largest commercial (paying subscribers) legal

search service (started 1975; ranking added 1992)

  Tens of terabytes of data; 700,000 users

  Majority of users still use boolean queries

  Example query:   What is the statute of limitations in cases involving the

federal tort claims act?

  LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM

  /3 = within 3 words, /S = in same sentence

Example: WestLaw http://www.westlaw.com/

  Another example query:   Requirements for disabled people to be able to access a

workplace

  disabl! /p access! /s work-site work-place (employment /3 place

  Note that SPACE is disjunction, not conjunction!

  Long, precise queries; proximity operators; incrementally developed; not like web search

  Professional searchers often like Boolean search:   Precision, transparency and control

  But that doesn’t mean they actually work better….

• 43

Boolean queries: More general merges

Exercise: Adapt the merge for the queries:

Brutus AND NOT Caesar

Brutus OR NOT Caesar

Can we still run through the merge in time O(x+y) or

what can we achieve?

• 44

Merging

What about an arbitrary Boolean formula?

(Brutus OR Caesar) AND NOT

(Antony OR Cleopatra)

  Can we always merge in “linear” time?   Linear in what?

  Can we do better?

Query optimization

What is the best order for query processing?

Consider a query that is an AND of t terms.

For each of the t terms, get its postings, then AND

them together.

Brutus

Calpurnia

Caesar

1 2 3 5 8 16 21 34

2 4 8 16 32 64 128

13 16

Query: Brutus AND Calpurnia AND Caesar

Query optimization example

Process in order of increasing freq:   start with smallest set, then keep cutting further.

Brutus

Calpurnia

Caesar

1 2 3 5 8 13 21 34

2 4 8 16 32 64 128

13 16

This is why we kept freq in dictionary

Executethequeryas(CaesarANDBrutus)ANDCalpurnia.

More general optimization

  e.g., (madding OR crowd) AND (ignoble OR

strife)

  Get freq’s for all terms.

  Estimate the size of each OR by the sum of its

freq’s (conservative).

  Process in increasing order of OR sizes.

Exercise

Recommend a query

processing order for

(tangerine OR trees) AND

(marmalade OR skies) AND

(kaleidoscope OR eyes)

Term Freq eyes 213312

kaleidoscope 87009

marmalade 107913

skies 271658

tangerine 46653

trees 316812

Query processing exercises

  If the query is friends AND romans AND (NOT

countrymen), how could we use the freq of

countrymen?

  Exercise: Extend the merge to an arbitrary

Boolean query. Can we always guarantee

execution in time linear in the total postings size?

  Hint: Begin with the case of a Boolean formula

query: in this, each query term appears only once

in the query.

• 50

Exercise

  Try the search feature at

http://www.rhymezone.com/shakespeare/

  Write down five search features you think it could

do better

What’saheadinIR?Beyondtermsearch

  Whataboutphrases?   StanfordUniversity

  Proximity:FindGatesNEARMicroso3.   Needindextocaptureposi%oninforma%onindocs.

  Zonesindocuments:Finddocumentswith(author=Ullman)AND(textcontainsautomata).

• 51

Evidenceaccumula5on

  1vs.0occurrenceofasearchterm   2vs.1occurrence   3vs.2occurrences,etc.   UsuallymoreseemsbeOer

  Needtermfrequencyinforma%onindocs

• 52

Rankingsearchresults

  Booleanqueriesgiveinclusionorexclusionofdocs.

  ORenwewanttorank/groupresults   Needtomeasureproximityfromquerytoeachdoc.   Needtodecidewhetherdocspresentedtouseraresingletons,oragroupofdocscoveringvariousaspectsofthequery.

• 53

IRvs.databases:Structuredvsunstructureddata

  Structureddatatendstorefertoinforma%onin“tables”

54

Employee Manager Salary

Smith Jones 50000

Chang Smith 60000

50000 Ivy Smith

Typically allows numerical range and exact match

(for text) queries, e.g.,

Salary < 60000 AND Manager = Smith.

Unstructureddata

  Typicallyreferstofree‐formtext

  Allows   Keywordqueriesincludingoperators   Moresophis%cated“concept”queries,e.g.,

  findallwebpagesdealingwithdrugabuse

  Classicmodelforsearchingtextdocuments

• 55

Semi‐structureddata

  Infactalmostnodatais“unstructured”

  E.g.,thisslidehasdis%nctlyiden%fiedzonessuchastheTitleandBullets

  Facilitates“semi‐structured”searchsuchas   TitlecontainsdataANDBulletscontainsearch

…tosaynothingoflinguis%cstructure

• 56

Moresophis5catedsemi‐structuredsearch

  TitleisaboutObjectOrientedProgrammingANDAuthorsomethinglikestro*rup

  where*isthewild‐cardoperator

  Issues:   howdoyouprocess“about”?   howdoyourankresults?

  ThefocusofXMLsearch(IIRchapter10)

• 57

Clustering,classifica5onandranking

  Clustering:Givenasetofdocs,groupthemintoclustersbasedontheircontents.

  Classifica%on:Givenasetoftopics,plusanewdocD,decidewhichtopic(s)Dbelongsto.

  Ranking:Canwelearnhowtobestorderasetofdocuments,e.g.,asetofsearchresults

• 58

Thewebanditschallenges

  Unusualanddiversedocuments

  Unusualanddiverseusers,queries,informa%onneeds

  Beyondterms,exploitideasfromsocialnetworks   linkanalysis,clickstreams...

  Howdosearchengineswork?AndhowcanwemakethembeOer?

• 59

Moresophis5catedinforma7onretrieval

  Cross‐languageinforma%onretrieval

  Ques%onanswering

  Summariza%on

  Textmining

  …

• 60

Vector Spaces

Definition (1)

  A set V is a vector space over a field F (for example, the field of real or of complex numbers) if, given

  an operation vector addition defined in V, denoted v + w (where v, w ∈ V), and

  an operation, scalar multiplication in V, denoted a * v (where v ∈ V and a ∈ F),

  the following properties hold for all a, b ∈ F and u, v, and w ∈ V:

  v + w belongs to V. (Closure of V under vector addition)

  u + (v + w) = (u + v) + w (Associativity of vector addition in V)

  There exists a neutral element 0 in V, such that for all elements v in V, v + 0 = v (Existence of an additive identity element in V)

Definition (2)

  For all v in V, there exists an element w in V, such that v + w = 0 (Existence of additive inverses in V)

  v + w = w + v (Commutativity of vector addition in V)

  a * v belongs to V (Closure of V under scalar multiplication)

  a * (b * v) = (ab) * v (Associativity of scalar multiplication in V)

  If 1 denotes the multiplicative identity of the field F, then 1 * v = v (Neutrality of one)

  a * (v + w) = a * v + a * w (Distributivity with respect to vector addition.)

  (a + b) * v = a * v + b * v (Distributivity with respect to field addition.)

An example of Vector Space

  For all n, Rn forms a vector space over R, with component-wise operations.

  Let V be the set of all n-tuples, [v1,v2,v3,...,vn] where vi is a member of R={real numbers}

  Let the field be R, as well

  Define Vector Addition: For all v, w, in V, define v+w=[v1+w1,v2+w2,v3+w3,...,vn+wn]

  Define Scalar Multiplication: For all a in F and v in V, a*v=[a*v1,a*v2,a*v3,...,a*vn]

  Then V is a Vector Space over R.

Linear dependency

  Linear combination:

  α1 v1 + …+ αn vn = 0 for some α1…αn not all zero

⇒ y = α1 v1 + …+ αn vn has a unique expression

  In case αi > 0 and the sum is 1 it is called convex combination

Normed Vector Spaces

  Given a vector space V over a field K, a norm on V is a function from V to R,

  it associates each vector v in V with a real number, ||v||

  The norm must satisfy the following conditions:   For all a in K and all u and v in V,

1. ||v|| ≥ 0 with equality if and only if v = 0

2. ||av|| = |a| ||v||

3. ||u + v|| ≤ ||u|| + ||v||

  A useful consequence of the norm axioms is the inequality   ||u ± v|| ≥ | ||u|| - ||v|| |

  for all vectors u and v

Inner Product Spaces

  Let V be a vector space and u, v, and w be vectors in V and c be a constant.

  Then, an inner product ( , ) on V is   a function with domain consisting of pairs of vectors and

  range real numbers satisfying

  the following properties:

1. (u, u) > 0 with equality if and only if u = 0.

2. (u, v) = (v, u)

3. (u + v, w) = (u, w) + (v, w)

4. (cu, v) = (u, cv) = c(u, v)

Example

  Let V be the vector space consisting of all continuous functions with the standard + and *. Then define an inner product by

  For example:

  The four properties follow immediately from the analogous property of the definite integral:

Inner Product Properties

  (v, 0) = 0

 

  If (v, u) = 0, v,u are called orthogonal

  Schwarz Inequality:

  [(v, u)]2 ≤ (v, v) (u, u)

  The classical scalar product is the component-wise product   (x1 , x2, … ,xn) (y1 , y2, … ,yn) = x1 y1 + x2 y2+ … +xn yn

 

),(|||| vvv =

||||||||

),(),cos(

vu

vuvu

!=

Projection

  From cos(!x,!w) =

!x !!w

||!x || ! ||

!w ||

  It follows that

||!x || cos(

!x,!w) =

!x !!w

||!w ||

=!x !

!w

||!w ||

  Norm of times the cosine between and ,

i.e. the projection of on

x!

!w

!wx

!

!x

Similarity Metrics

  The simplest distance for continuous m-dimensional instance space is Euclidian distance.

  The simplest distance for m-dimensional binary instance space is Hamming distance (number of feature values that differ).

  Cosine similarity is typically the most effective

The Vector Space Model (VSM)

Berlusconi

Bush

Totti

Bush declares war. Berlusconi gives support

Wonderful Totti in the yesterday match against Berlusconi’s Milan

Berlusconi acquires Ibrahimovićbefore elections

d1: Politic

d1

d2

d3

q1

q1: Berlusconi supports

d2: Sport d3:Economic

q2

q2 : Totti match

Summary of VSM

  VSM (Salton89’)   Features are dimensions of a Vector Space

Linear Kernel

  Documents and Queries are vectors of feature weights.

  d is retrirved for if !d !!q > thq