SIMS 290-2: Applied Natural Language Processing

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOctober 25, 2004

2

Next Few Classes

This week: lexicons and ontologiesToday:

WordNet’s structure, computing term similarity

Wed: Guest lecture: Prof. Charles Fillmore on FrameNet

Next week: Enron labeling in classThe entire assignment will be due on Nov 15

Following week: Question-Answering

3

Text Categorization Assignment

Great job, you learned a lot!Comparing to a baselineSelecting featuresComparing relative usefulness of featuresTraining, testing, cross-validation

I learned a lot too! (from your results)(I’ll send you your feedback today)

4


FeaturesBoosting weights of terms in subject line is helpful.Stemming does help in some circumstances (often works well with SVM, for example), but not always.

– Counter-intuitively, stemming can increase the number of features in our implementation, because it increases how many terms pass the minimum-document-occurrence cutoff.

– An example of the porter stemmer not hiding differences when it otherwise would: converting gaseous to "gase" and so not conflating "gas" for fuel for motorcycles with "gaseous" for the science group.

5


FeaturesTerms with more than just the default alphabetical terms are helpful, maybe because in part getting the domain name information, but also because of getting technical terms.It's probably best to use the Weka feature selector to tell you what *kind* of features are performing well, but not to select those for use exclusively. I'm surprised that no one tried bigrams or noun-noun compounds as features.

6


Feature WeightingTf.idf: Almost everyone who tried it found it was raw term frequency (there were exceptions). Binary feature weights with document count minimum thresholds can be a good substitute.An interesting variation on tf.idf is to do it in a class-based manner.

– weight terms higher that only occur in one class vs. the others.

– A couple of students tried this and got good results on the diverse comparison, but less good on the homogenous. This makes sense since the measure would not help as much in distinguishing similar newsgroups that share many terms.

7


ClassifiersNaïve-Bayes Multinomial was a clear winnerSVM worked well most of the time, but not as well as NBMNaive Bayes seemed to be more robust to unseen information; the kernel estimator seems to improve the default Naive Bayes settings.VotedPerceptron worked very well, but only does binary classification so people who found it did very well on diverse did not transfer it to homogenous.

8

Today

Lexicons, Semantic Nets and OntologiesThe Structure of WordNetComputing SimilaritiesAutomatic Acquisition of New Terms

9

Lexicons, Semantic Nets, and Ontologies

Lexicons are (typically) word lists augmented with some subset of:Parts-of-speechDifferent word sensesSynonyms

Semantic NetsInclude links to other termsIS-A, Part-Of, etc.Sometimes this term is used for what I call ontologies

OntologiesRepresent concepts and relationships among conceptsLanguage independent (in principle)Sometimes include inference rulesDifferent from definition in philosophy

– The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality

10Adapted from slide by W. Ceusters, www.landc.be

One approach to linking ontologies and lexicons

Formal Domain Ontology

Lexicon

Grammar

Language ALanguage A

Lexicon

Grammar

LanguageLanguage BB

Cassandra Linguistic Ontology MEDRA

ICD

SNOMED

ICPC

Others ...

Proprietary Terminologies


Example Ontological Relation Types

HAS-PARTIAL-SPATIAL-OVERLAP

IS-TOPO-

INSIDE-OF

IS-GEO-INSIDE-

OF

IS-INSIDE-

CONVEX-HULL-OF

IS-PARTLY-IN-CONVEX-

HULL-OFIS-OUTSIDE-CONVEX-HULL-OF

HAS-DISCONNECTED-

REGION

HAS-EXTERNAL-

CONNECTING-REGION

HAS-DISCRETED-REGION

HAS-TANG.-SPAT.-PART

HAS-NON-TANG.-SPAT.-PART

IS-SPAT.-

EQUIV.-OF

IS-TANG.-SPAT.-PART-

OF

IS-NON-TANG.-SPAT.-PART-

OF

HAS-PROPER-SPATIAL

-PART

IS-PROPER-

SPAT.-PART-

OF

HAS-SPATIAL

-PART

IS-SPATIAL-PART-

OF

HAS-OVERLAPPING

-REGION

HAS-CONNECTING-

REGION

HAS-SPATIAL-POINT-

REFERENCE


Example of applying an ontology: joint anatomy

joint HAS-HOLE joint spacejoint capsule IS-OUTER-LAYER-OF jointmeniscus

IS-INCOMPLETE-FILLER-OF joint spaceIS-TOPO-INSIDE joint capsuleIS-NON-TANGENTIAL-MATERIAL-PART-OF joint

joint IS-CONNECTOR-OF bone XIS-CONNECTOR-OF bone Y

synoviaIS-INCOMPLETE-FILLER-OF joint space

synovial membrane IS-BONAFIDE-BOUNDARY-OF joint space

This doesn’t include the linguistic side


Linking Lexicons and Ontologies

Having a healthcare phenomenon

Generalised PossessionHealthcare phenomenonHuman

IS-A

Has-possessor Has-

possessed

PatientIs-possessor-of

Patient at risk

IS-A Has-Healthcare-phenomenon

Risk Factor

IS-AIs-Risk-

Factor-Of

Patient at risk for osteoporosis

Risk factor for osteoporosis Osteoporosis

Has-Healthcare-phenomenon

Is-Risk-Factor-Of

IS-A IS-A IS-A

11 1

2

2

IS-A

3

3

44


Linking different lexiconsMESH-2001 : “Seizures”

MESH-2001 : “Convulsions”

Snomed-RT : “Convulsion”

Snomed-RT : “Seizure”

L&C : ConvulsionL&C : Seizure

L&C : Health crisis

L&C : Epileptic convulsion

IS-AIS-A

IS-AIS-A

IS-narrower-than ISA

Has-CCC

Has-CCC

Has-CCC

Has-CCC

15

WordNet

A big lexicon with properties of a semantic netStarted as a language project by Dr George Miller and Dr. Christiane Fellbaum at PrincetonFirst became available in 1990Now on version 2.0

16

WordNet

Huge amounts of research (and products) use it

17

WordNet RelationsOriginal core relations:

SynonymyPolysemyMetonymyHyponymy/HyperonymyMeronymyAntonymy

New, useful additions for NLPGlossesLinks between derivationally and semantically related noun/verb pairs. Domain/topical termsGroups of similar verbs

Others on the wayDisambiguation of terms in glossesTopical clustering.

18

Synonymy

Different ways of expressing related conceptsExamples

cat, feline, Siamese cat

Synonyms are almost never truly substitutable:

Used in different contextsHave different implications

– This is a point of contention.

19

Polysemy

Most words have more than one senseHomonym: same word, different meaning

– bank (river)– bank (financial)

Polysemy: different senses of same word– That dog has floppy ears.– She has a good ear for jazz.– bank (financial) has several related senses

the building, the institution, the notion of where money is stored

20

Metonymy

Use one aspect of something to stand for the whole

The building stands for the institution of the bank.Newscast: “The White House released new figures today.”Waitperson: “The ham sandwich spilled his drink.”

21

Hyponymy/HyperonymyISA relationRelated to Superordinate and Subordinate level categories

hyponym(robin,bird)hyponym(bird,animal)hyponym(emu,bird)

A is a hypernym of B if B is a type of AA is a hyponym of B if A is a type of B

22

Meronymy

Parts-of relationpart of(beak, bird)part of(bark, tree)

Transitive conceptually but not lexically:The knob is a part of the door.The door is a part of the house.? The knob is a part of the house ?

23

Antonymy

Lexical oppositesantonym(large, small)antonym(big, small)antonym(big, little)but not large, little

Many antonymous relations can be reliably detected by looking for statistical correlations in large text collections. (Justeson &Katz 91)

24

Using WordNet in Pythonfrom wordnet import *from wntools import *

25

Using WordNet in Python

from wordnet import *from wntools import *

26

More Readable Output

27

Using WordNet to Determine Similarity

The “meet” function in the python wordnet tool finds the closest common parent to two terms

28

Similarity by Path Length

Count the edges (is-a links) between two concepts and scale Leacock and Chodorow, 1998

lch(c1,c2) =-log [(length(c1,c2) / 2 * max-depth]

Wu and Palmer, 1994wup(c1,c2) =

2 * depth(lcs(c1,c2)) / [depth (c1) + depth (c2)]

29

Problems with Path LengthThe lengths of the paths are irregular across the hierarchiesWords might not be in the same hierarchies that should beHow to relate terms that are not in the same hierarchies?

The “tennis problem”:– Player– Racquet– Ball– Net

Are all in separate hierarchiesWordNet is working on developing such linkages

30

31

32

33

34

Similarity by Information ContentIC estimated from a corpus of text (Resnik, 1995)

IC(concept) = -log(P(concept)) Specific Concept

High IC (pitchfork)General Concept

Low IC (instrument)To estimate it:

Count occurrences of “concept” Given a word, increment count of all concepts associated with that word

– increment bank as financial institution and also as river shore.– Assume that senses occur uniformly lacking evidence to the

contrary (e.g., sense tagged text)Counts propagate up the hierarchy

35

Information Content as Similarity

Resnik, 1995res(c1,c2) = IC (lcs (c1,c2))

Jiang and Conrath, 1997jcn(c1,c2) =

1 / [2*res(c1,c2) – (IC (c1) + IC(c2))]

Lin, 1998lin(c1,c2) = 2*res(c1,c2) / [IC(c1) + IC(c2)]

All of these (and more!) are implemented in a perl package

Called SenseRelate, Pedersen et al.http://wn-similarity.sourceforge.net/

36

Rearranging WordNet

Try to fix the top-level hierarchiesParse the glosses for more informationeXtended WordNet

projecthttp://xwn.hlt.utdallas.edu/

37

Augmenting WordNetLexico-syntactic Patterns (Hearst 92, 97)

38

Augmenting WordNetLexico-syntactic Patterns (Hearst 92, 97)

39

40

41

42

Acquisition using the Web

Towards Terascale Knowledge Acquisition, Pantel and Lin ’04Use co-occurrence model and a huge collection (the Web) to find similar terms

Input: a cluster of related wordsFeature vectors computed for each word

– Catch ___– Compute mutual information between the word and the

context

“Average” the features for each class to create a grammatical template for each class

43


Use this template to find new examples of this class of terms (but it makes many errors)

44

Next Time

FrameNetA background paper is on the class website(Not required to read it beforehand)

45


Documents

SIMS 290-2: Applied Natural Language Processing