12
WordNet and other NLP tools Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu [email protected]

Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu [email protected]

Embed Size (px)

Citation preview

Page 1: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet and other NLP tools

Course G22.2580 - Web Search Engines 3/9/2011

Wei Xu [email protected]

Page 2: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet – What is it?

WordNet® a large lexical database of English a combination of dictionary and

thesaurus created and maintained by Cognitive

Science Lab of Princeton University designed to establish the connections

between words

Page 3: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet

http://wordnet.princeton.edu/

Page 4: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet – some concepts

WORDnet 4 types of Parts of Speech (POS)▪ Noun, Verb, Adjective, Adverb

Synset▪ the smallest unit in WordNet▪ a synonym set▪ Represent a specific meaning of a word

Page 5: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet – some concepts wordNET

Synsets are connected to one anther through semantic and lexical relations

Type of relations (based on POS)▪ hypernyms (kind-of): ‘vehicle’ is a hypernym of ‘car’▪ hyponyms (kind-of): ‘car’ is a hyponym of ‘vehicle’▪ holonym (part-of): ‘building’ is a holonym of ‘window’▪ meronym(part-of): ‘window’ is a meronym of ‘building’▪ similar to: ‘smart’ is similar to ‘intelligent’ ▪ antonyms: ‘smart’ is antonym of ‘unintelligent’

Page 6: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet – some concepts

hypernym

hyponym

Page 7: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

WordNet – interfaces

Unix-style manual Web Interfaces Local Interfaces/APIs

Java Perl C#

http://wordnet.princeton.edu/wordnet/related-projects/#web

Page 8: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

Stemming

Definition: the process for removing suffixes of

words to get their base or root form

Example: ‘fishing’, ‘fished’, ‘fish’, ‘fisher’ ‘fish’

Page 10: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

NLP techniques

Tokenization The process of breaking a stream of text

up into “words” and punctuation marks. Sentence Splitting Part of Speech Tagging

Example:

He/PRP 's/VBZ at/IN peace/NN with/IN the/DT house/NN and/CC could/MD stay/VB there/RB indefinitely/RB ./.

Page 11: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

NLP techniques (cont’)

Name Entity Recognition The process of labeling sequences of words which are

the names of things, such as person, company, location names.

Example:

Jim bought 300 shares of Acme Corp. in 2006.

<ENAMEX TYPE="PERSON">Jim</ENAMEX> bought 300 shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>.

Page 12: Course G22.2580 - Web Search Engines 3/9/2011 Wei Xu xuwei@cs.nyu.edu

NLP tools

Stanford POS tagger http://nlp.stanford.edu/software/tagger.shtml

Stanford NER http://nlp.stanford.edu/software/CRF-

NER.shtml GATE

http://gate.ac.uk/ JET

http://cs.nyu.edu/grishman/jet/license.html http://www.cs.nyu.edu/courses/spring10/G22

.2590-001/schedule.html