Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Ges$one Avanzata dell’Informazione Part A – Full-‐Text Informa$on Mangement
Text Opera$ons
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 2
Introduc$on
} Improving the precision of the documents retrieved } With Document Preprocessing } By Building Thesaurus
} Reference example
…. He said that the chairs were enough …
3 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Document Preprocessing
Document preprocessing is a procedure which transforms a document into a set of index terms
Text Opera3ons for Document Preprocessing Lexical Analysis of the text } Trea$ng digits, hyphens, punctua$on marks, the case of leUers Elimina3on of stopwords } Filtering out the useless words for retrieval purposes Stemming of the remaining words } Dealing with the syntac$c varia$ons of query terms Selec3on of index terms } Determining the terms to be used as index terms Construc3on of term categoriza3on structures } Allowing the expansion of the original query with related term
4 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Document Preprocessing
5 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
} Process of conver$ng a stream of characters into a stream of tokens } Token: group of characters with collec$ve significance
} Produces candidate index terms } Ways to implement a lexical analyser:
} Use a lexical analyzer generator (e.g. UNIX tool lex) } Write a lexical analyser by hand ad hoc } Write a lexical analyser by hand as a finite state machine
} Ref. Example: “He said that the chairs were enough”
Lexical Analysis of the Text
He said that the chairs were enough
6 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
A sample of a transi$on diagram of a DFA
0
Space
1
2
3
4
5
6
7
8
LeUer
(
) & |
^ eos
other
7 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
DFA: Determinis3c Finite-‐state Automaton
Digits Usually not good index terms because of its vagueness } However digits as B6 (vitamin) are significant terms } It needs some advanced lexical analysis procedure Ex) 510B.C. , 4105-‐1201-‐2310-‐2213, 2000/2/12, ….
Hyphens Breaking up hyphenated words might be useful Ex) state-‐of-‐the-‐art à state of the art (Good!)
But, MS-‐DOS à MS DOS (???) } It needs to adopt a general rule and to specify the excep$on on a case by
case basis Punctua3on Marks Are removed en$rely
Ex) 510B.C à 510BC } If the query contains ‘510B.C’, remove of the dot both in query term and in the
documents will not affect retrieval performance } Require the prepara$on of a list of excep$ons
Ex) val.id à valid (???) The Case of LeGers Converts all the text to either lower or upper case
} Part of the seman$cs might be lost Ex) Korea University à korea university (???)
What counts as a token? Four par$cular cases
8 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
} Basic Concept } Filtering out words with very low discrimina$on values
Ex) a, the, this, that, where, when, …. } Advantage
} A very frequent word is useless for purposes of retrieval } Reducing the size of the indexing structure considerably
} Ways to filter stoplist words from an input token stream } Examine lexical analyzer output and remove any stopwords
} Standard list searching problem } Search trees, bynary search on an ordered array and hashing can be used
} Remove stopwords as part of lexical analyzer } Ref. Example:
Elimina$on of Stopwords
He said that the chairs were enough
said chairs enough were
9 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
} Basic Idea: “Provide searchers with ways of finding morphological variants of search terms” Ex) query ‘stemming’ also search for ‘stemmed’ and ‘stem’
} What is the “stem” ? } The por$on of a word which is leo aoer the removal of its affixes (i.e., prefixes
and suffixes) Ex) ‘connect’ is the stem for the variants ‘connected’, ‘connecQng’, ‘connecQon’,
‘connecQons’
} The Effect of stemming at indexing $me } Reduce variants of the same root word to a common concept } Reduce the size of the indexing structure
} Ref. Example:
Stemming
said chairs were enough
say chair enough be
10 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Table Lookup } Store a table of all index terms and their stems } Simply looking for the stem of a word Advantages Fast and easy Disadvantages Dependent on data on stems for the whole language and requiring
considerable storage space Successor variety } The successor variety of a string is the number of different characters that
follow it in a set of words in some body of text Ex) Text word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE R – 3, RE – 2, REA – 1, READ – 3, …
} From a large body of text, usually the successor variety of a substring decreases as a character is added, un$l a segment boundary (stem) is reached } “read” is a segment (or stem)
} Different strategies to segment a word. For instance, in the complete word method the segment must be a complete word in the corpus
Stemming strategies
11 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
N-‐grams } Iden$fica$on of digrams and trigrams and computa$on of common bi-‐
trigrams Ex) sta$s$cs & sta$s$cal: 6 common bigrams and the similarity is
(2*6/7+8) = 0.8 } Similarity matrix of terms } Clustering of terms Affix Removal } Remove suffixes and/or prefixes from terms leaving a stem and eventually
transform the resul$ng stem } It is based on a set of rules
Ex) Rules plural ⇒ singular suffix “ies” but not “eies” or “aies” ⇒ “y” “es” but not “aes”, “ees”, “oes” ⇒ “e” “s”, but not “ss” or “us” ⇒ NULL
} The most popular stemmer: Porter
Stemming strategies
12 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Judging stemmers
} Correctness } Possible incorrectness
} Overstemming: too much is removed } Understemming: too liUle is removed
} Retrieval effec$veness } Stemming can affect retrieval performance (for the majority, posi$vely) } The effect of stemming depends on the nature of the vocabulary } There are liUle differences between the retrieval effec$veness of different full
stemmers
} Compression performance } 5 stemmers on 4 data sets: Compression from 26.1% to 47.5%
} There is controversy about the benefits of stemming } For this reason some Web search engines do not adopt any stemming algorithm
(Google does)
13 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Noun selec$on
} Not all words are equally significant for represen$ng the seman$cs of a document
Manual Selec3on } Selec$on of index terms is usually done by specialist Automa3c Selec3on of Index Terms } Most of the seman$cs is carried by the noun words } Use of parsers or taggers
} Ref. Example:
Index Term Selec$on
say chair enough be
chair
14 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 15
} A list of important words in a given domain of knowledge and for each word in this list, a set of related words such as common varia$on, derived from a synonymity rela$onship
} Examples of thesaurus } Roget (published in 1852), Wordnet, INSPEC thesaurus
} Thesauri can be constructed } manually } automa$cally from a collec$on of documents or by merging exis$ng
thesauri
What is a “Thesaurus”?
16 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Mo$va$on for using a thesaurus } Using a controlled vocabulary for
} Indexing } Normaliza$on of indexing concepts } Reduc$on of noise } Iden$fica$on of indexing terms with a clear seman$c
} Searching } To assist users with proper query formula$on } To provide classified hierarchies that allow the broadening and narrowing of the current query request
} Par$cularly important for specific domain (e.g. medicine) } For general domain the usefulness is not clear
} However search engines usually provide directories to reduce the space to be searched
17 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
} Thesaurus Index Term } Used to denote a concept which is the basic seman$c unit } Can be individual words, groups of words, or phrases
E.g.) Building, Teaching, BallisQc Missiles, Body Temperature } Frequently, it is necessary to complement a thesaurus entry with a
defini$on or an explana$on } E.g.) Seal (marine animals), Seal (documents)
} Thesaurus Term Rela$onships } Mostly composed of synonyms and near-‐synonyms } BT(Broader Term), NT(Narrower Term), RT(Related Term) } BT and NT can be iden$fied in a fully automa$c manner } Dealing with RT is much harder, because it is dependent on the specific
context and par$cular needs
Thesauri
18 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Use a thesaurus? } Consequence of manual selec$on
} Time consuming } The person using the retrieval system has to be familiar with the
thesaurus } Thesaurus are some$mes incoherent
} Consequence of automa$c selec$on } Computa$onally too expensive in real-‐world sexngs } Coverage } Language dependence } Need of WSD techniques } The resul$ng representa$ons may be too explicit to deal with the
vagueness of a user's informa$on need } Alterna$ve: a document is simply an unstructured set of words appearing in it: bag of words
19 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
WordNet thesaurus
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 20
} WordNet Ontology -‐ hUp://wordnet.princeton.edu/ } It provides concepts from many domains } It can be easily extended to languages other than English } It presents rela$ons between concepts which are easy to understand
and use
photographic camera
camera
television camera
sense1
sense2
Word Sense synonymous
words
polysemous words
WordNet: Lexical Database
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 21
Hypernymy Meronymy Is-‐value-‐of
Kitchen Appliances
Toaster
Camera
Op$cal Lens
Speed
Slow Fast
WordNet: Seman$c Rela$ons
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 22
Rela3on Meaning Examples Synonymy (N, V, Adj, Adv)
Same sense (camera, photographic camera) (mountain climbing, mountaineering) (fast, speedy)
Antonymy (Adj, Adv)
Opposite (fast, slow) (buy, sell)
Hypernymy (N) Is-‐A (camera, photographic equipment) (mountain climbing, climb)
Meronymy (N) Part (camera, op$cal lens) (camera, view finder)
Troponymy (V) Manner (buy, subscribe) (sell, retail)
Entailment (V) X must mean doing Y (buy, pay) (sell, give)
WordNet: Seman$c Rela$ons
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 23
instrumenta$on
equipment
photographic equipment
flash
device
lamp
flash
Hypernymy Is-‐A rela$ons
WordNet: Hierarchy
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 24
Word Similarity } Synonymy: a binary rela$on
} Two words are either synonymous or not
} Similarity (or distance): a looser metric } Two words are more similar if they share more features of meaning
} We ooen dis$nguish word similarity from word relatedness } Similar words: near-‐synonyms } Related words: can be related any way
} car, bicycle: similar } car, gasoline: related, not similar
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 25
Why word similarity? } Informa$on retrieval } Ques$on answering } Machine transla$on } Natural language genera$on } Language modeling } Automa$c essay grading } Plagiarism detec$on } Document clustering } Word sense disambigua$on } …
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 26
Two classes of similarity measures } Path-‐based measures
} E.g. Are words “nearby” in hypernym hierarchy?
} Informa$on-‐content measures } Do words have similar distribu$onal contexts?
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 27
Path-‐based similari$es
} Two concepts (senses/synsets) are similar if they are near each other in the thesaurus hierarchy } have a short path between them } concepts have path 1 to themselves
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 28
Path-‐based similari$es } Path Distance Similarity: based on the shortest path that connects the senses in the is-‐a (hypernym/hypnoym) taxonomy
} Wu-‐Palmer Similarity: based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node)
} …
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 29
} The similarity between two senses is related to their common informa$on
} The more two senses have in common, the more similar they are
} P(c): is the probability that a randomly selected word in a corpus is an instance of concept c
} Informa3on-‐content: } IC(c) = -‐log P(c)
Informa$on-‐content similari$es
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 30
} Resnik similarity: based on the Informa$on Content (IC) of the Lowest Common Subsumer (most specific ancestor node)
} Measures common informa$on as: } simresnik(c1,c2) = -‐log P( LCS(c1,c2) )
} Other IC similari$es: } Jiang-‐Conrath } Lin } …
Informa$on-‐content similari$es
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 31
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 32
Other Text Opera$ons } In document preprocessing
} Parsers } Taggers } Word Sense Disambigua$on
} Improving efficiency } Text compression
33 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Parsers and Taggers } A syntac3c parser is a tool that assigns a syntac$c structure to any
sentence in the language. It iden$fies: } Its parts, labeling each } The part of speech of every word } Seman$c classes and func$onal classes, eventually
} It is based on a grammar } The correctness of available general-‐purpose parsers do not exceed 90% } The most promising approach is the sta$s$c one, which is based on large
bodies of readable textual data } Treebanks: Brown Corpus, LOB corpus, and Penn project
} Popular parsers: Link Grammar Parser, Apple Pie Parser, Cass } Taggers: only assign to each word the part of speech it assumes in the
context } Correctness 95-‐99% with reasonable efficiency } Popular taggers: CLAWS, QTAG
34 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
Word Sense Disambigua$on } WSD involves the associa$on of a given word in a text with a defini$on or meaning
} Two steps: } Determine all the different senses } Assign each occurrence of a word to the appropriate sense
} First step: adop$on of a dic$onary or thesaurus } The results depend on the adopted solu$on
} Second step: } Analysis of the context
} Bag of words approach: the context is a window of words next to the term to disambiguate
} Rela$onal informa$on approach: along with the bag of words other informa$on such as their distance are also extracted
} Use of external knowledge sources
35 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 36
An approach for Word Sense Disambigua$on } WordNet is used as reference thesaurus } Noun sense disambigua$on
} The context for each noun is the set of the other nouns in the sentence
} Intui$on } If two polysemic words are similar then their common concepts provide
informa$on about the most suitable meanings } For each noun N, for each WordNet sense sN of N
} Compute the confidence CSN in choosing sN as sense of N } Select the sense with the highest confidence
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 37
An approach for Word Sense Disambigua$on } The confidence CSN in choosing sN as sense of N is influenced by } The similarity between sN and all the senses of the nouns in the context
} The frequency of that sense
} The similarity between two senses is influenced by } The distance in the hypernymy hierarchy of WordNet (see path-‐based similari$es)
} The distance among the involved words
An approach for Word Sense Disambigua$on
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 38
Placental mammal
Carnivore Rodent
Feline, felid
Cat (meaning 1)
Mouse (meaning1)
1
2
3 4
5
len(cat#1, mouse#1) = 5 sim(cat#1,mouse#1) = 1,856
Example: “The cat is hun$ng the mouse” The highest confidence among the meanings of “cat” is the one in the hierarchy. The same for “mouse”. à Meaning of Cat: #1 à Meaning of Mouse: #1
movie
$tle awards director plot
10 meanings 3 meanings 5 meanings 4 meanings
The informa$on given by allows users to clearly state the meaning of the terms
However manual annota$on is ooen
“The story that is told
in a movie”
Beyond WSD: Structural Disambigua$on
39 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
A
movie
$tle awards director plot
term (movie, …)
(director, …) (awards, …)
(title, …)
B
C
1. “a secret scheme to do something…”
2. “the story that is told in a movie…”
à scheme; governor; game; …
à story; movie; characters; …
sim( A , B ) = low
sim( A , C ) = high
plot#2
Beyond WSD: Structural Disambigua$on
40 GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons
• Document Processing
• Thesaurus • Other text opera$ons
• Hands-‐on with Python and NLTK
Contents
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 41
NLTK
} Suite of classes for several NLP tasks
} Parsing, POS tagging, classifiers…
} Several text processing utilities, corpora } Brown, Penn Treebank corpus…
} Lybrary for the Python language
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 42
NLTK l Text material
§ Raw text § Annotated Text
l Tools § Part of speech taggers § Seman$c analysis § …
l Resources § WordNet, Treebanks
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 43
Linguistic Tasks
} Part of Speech Tagging } Parsing } Word Net } Named Entity Recognition } Information Retrieval } Sentiment Analysis } Document Clustering } Topic Segmentation
} Authoring } Machine Translation } Summarization } Information Extraction } Spoken Dialog Systems } Natural Language
Generation } Word Sense
Disambiguation
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 44
l Download § hUp://nltk.org/
l Install § Windows:
} Right click and run the three installers for Python, PyYAML, and NLTK
§ Mac OSX: } Similar, but some terminal work required for PyYAML
l Download NLTK data >>> import nltk >>> nltk.download()
Installing NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 45
NLTK: Example Modules
} nltk.token: processing individual elements of text, such as words or sentences.
} nltk.probability: modeling frequency distribu$ons and probabilis$c systems.
} nltk.tagger: tagging tokens with supplemental informa$on, such as parts of speech or wordnet sense tags.
} nltk.parser: high-‐level interface for parsing texts. } nltk.chartparser: a chart-‐based implementa$on of the parser interface.
} nltk.chunkparser: a regular-‐expression based surface parser.
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 46
NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 47
Ok, now let’s try something!
NLTK – Tokeniza$on
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 48
} import nltk
} text = "This is a test" } tokens = nltk.word_tokenize(text)
} print tokens ['This', 'is', 'a', 'test’]
NLTK – POS Tagging
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 49
} print nltk.pos_tag(tokens) [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]
NLTK – Stopwords removal & Stemming
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 50
} from nltk.corpus import stopwords
} wnl = nltk.WordNetLemma3zer()
} for t in tokens: if not t in stopwords.words('english'): print wnl.lemma3ze(t)
This test
NLTK – Other Stemmers
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 51
} porter = nltk.PorterStemmer() } print [porter.stem(t) for t in tokens]
} lancaster = nltk.LancasterStemmer() } print [lancaster.stem(t) for t in tokens]
} …
NLTK – WordNet
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 52
} from nltk.corpus import wordnet as wn } print wn.synsets('dog’) [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
} dog = wn.synset('dog.n.01') } print dog.defini3on # (3.0): print dog.defini4on()
a member of the genus Canis (probably descended from the common wolf) that has been domes$cated by man since prehistoric $mes; occurs in many breeds
NLTK – WordNet
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 53
} print dog.examples # (3.0): print dog.examples()
['the dog barked all night']
} print dog.hypernyms()
[Synset('domes$c_animal.n.01'), Synset('canine.n.02')]
NLTK – Path-‐Distance Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 54
} cat = wn.synset('cat.n.01') } computer = wn.synset('computer.n.01') } print dog.path_similarity(cat) 0.2
} print dog.path_similarity(computer) 0.0909090909091
NLTK – Wu-‐Palmer Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 55
} print dog.wup_similarity(cat) 0.857142857143
} print dog.wup_similarity(computer) 0.444444444444
NLTK – Resnik Similarity
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 56
} from nltk.corpus import wordnet_ic } brown_ic = wordnet_ic.ic('ic-‐brown.dat') } print dog.res_similarity(cat,brown_ic) 7.91166650904
} print dog.res_similarity(computer,brown_ic) 1.53183374322
NLTK – WordNet Morphy
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 57
} Look up forms not in WordNet, with the help of Morphy
} print wn.morphy('denied') deny
} print wn.morphy('examples') example
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 58
} Let’s try a simple algorithm for the disambigua3on of terms, exploi3ng one of the term similarity formulas
} Basic idea: } Given a list of terms (for instance the keywords of a document)
} Disambiguate each term t_i in the list by exploi3ng the context provided by the other terms
} For each sense s_3 of t_i compute a score (confidence) by: } Considering each term t_j in the context (t_i ≠ t_j) } Adding to the score the similari3es between s_3 and the sense s_tj of t_j which is the most similar to s_3
} The sense s_3 with the highest score will be the most probable one
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 59
def disambiguateTerms(terms): for t_i in terms: # t_i is target term selSense = None selScore = 0.0 for s_3 in wn.synsets(t_i, wn.NOUN): score_i = 0.0 for t_j in terms: # t_j term in t_i's context window if (t_i==t_j): con3nue bestScore = 0.0 for s_tj in wn.synsets(t_j, wn.NOUN): tempScore = s_3.wup_similarity(s_tj) if (tempScore>bestScore): bestScore=tempScore score_i = score_i + bestScore if (score_i>selScore): selScore = score_i selSense = s_3 if (selSense is not None): print t_i,": ",selSense,", ",selSense.defini3on print "Score: ",selScore else: print t_i,": -‐-‐"
Paste and try! hVp://pastebin.com/A7sYM04x
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 60
} from nltk.corpus import wordnet as wn …
} print "\nCAT-‐MOUSE:" } disambiguateTerms(["cat","mouse"]) CAT-‐MOUSE: cat : Synset('cat.n.01') , feline mammal usually having thick soo fur and no ability to roar: domes$c cats; wildcats Score: 0.814814814815 mouse : Synset('mouse.n.01') , any of numerous small rodents typically resembling diminu$ve rats having pointed snouts and small ears on elongated bodies with slender usually hairless tails Score: 0.814814814815
A (very) simple WSD algorithm with NLTK
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 61
} print "\nCOMPUTER-‐MOUSE:" } disambiguateTerms(["computer","mouse"]) COMPUTER-‐MOUSE: computer : Synset('computer.n.01') , a machine for performing calcula$ons automa$cally Score: 0.777777777778 mouse : Synset('mouse.n.04') , a hand-‐operated electronic device that controls the coordinates of a cursor on your computer screen as you move it around on a pad; on the boUom of the device is a ball that rolls on the surface of the pad Score: 0.777777777778
NLTK, Python – Some more links
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 62
} Simple python Porter stemming implementa$on hUps://bitbucket.org/mchaput/stemming/src/ec98e232f984/stemming/porter.py
} Python Documenta$on hUp://python.org/doc/
} NLTK Documenta$on hUp://nltk.org/api/nltk.html hUp://nltk.googlecode.com/svn/trunk/doc/
} Natural Language Processing in Python book hUp://nltk.org/book
Another interes$ng library
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 63
} Topia term extract (for Python) hUp://pypi.python.org/pypi/topia.termextract/ } determines important terms within a given piece of content } uses linguis$c tools such as Parts-‐Of-‐Speech (POS) and some simple sta$s$cal analysis to determine the terms and their strength
} full source available
A visual way to explore a thesaurus
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 64
} Visuwords } hUp://www.visuwords.com
GAvI -‐ Full-‐Text Informa$on Management: Text Opera$ons 65
Bibliography } R. Baeza-‐Yates, B. Ribeiro-‐Neto “Modern Informa$on Retrieval” Addison
Wesley } W. B. Frakes, R. Baeza-‐Yates “Informa$on Retrieval” Pren$ce Hall } R. Martoglia, “Facilitate IT-‐Providing SMEs in Sooware Development: a
Seman$c Helper for Filtering and Searching Knowledge”, in proc. of SEKE 2011
} F. Mandreoli, R. Martoglia, P. Tiberio, “EXTRA: A System for Example-‐based Transla$on Assistance” In Machine Transla$on, Volume 20, Number 3, 2006
} F. Mandreoli, R. Martoglia, E. Ronchex, “Versa$le Structural Disambigua$on for Seman$c-‐aware applica$ons”, in proc. of ACM CIKM 2005
} Philip Resnik. 1995. Using Informa$on Content to Evaluate Seman$c Similarity in a Taxonomy. IJCAI 1995.
} Philip Resnik. 1999. Seman$c Similarity in a Taxonomy: An Informa$on-‐Based Measure and its Applica$on to Problems of Ambiguity in Natural Language. JAIR 11, 95-‐130.