27
Language and Information LIS 610 November 6, 2002 Nina Wacholder [email protected]. edu

Language and Information LIS 610 November 6, 2002 Nina Wacholder [email protected]

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

Language and Information

LIS 610

November 6, 2002

Nina Wacholder

[email protected]

Language and Information 11/06/02 Nina Wacholder 2

Agenda

Role of language in information science

Current research: Human Computer Interaction with Electronic Indexes and Index Terms

Language and Information 11/06/02 Nina Wacholder 3

Textual information

Information conveyed by alphabets, digits and punctuation Organized into meaningful units recognized by some group of

people

Language and Information 11/06/02 Nina Wacholder 4

Other techniques for conveying information

Spoken language Gesture Facial expression Sound Images (drawings, photographs …)

Language and Information 11/06/02 Nina Wacholder 5

Language

Uniquely human Learned Conventional

Language and Information 11/06/02 Nina Wacholder 6

Understanding language is hard

Expresses complex concepts Ambiguity – words, phrases and sentences have more than one

meaning Synonymy – words, phrases and sentences have more than one

meaning

Language and Information 11/06/02 Nina Wacholder 7

Complex concepts

Pencil Face Directions to Alexander Library Theory of relativity U.S. election law

Language and Information 11/06/02 Nina Wacholder 8

Synonymy

child, kid, adolescent, baby

flammable, inflammable

I was walking up the street that day. I was walking down the street that day.

Moxie wrote that report. That report was written by Moxie.

Language and Information 11/06/02 Nina Wacholder 9

Ambiguity-- semantic

Bat

Make a bed

Moxie ate potatoes with a fork. Moxie ate potatoes with fish.

Language and Information 11/06/02 Nina Wacholder 10

Ambiguity– structural (syntactic)

Red airplane terminal

• [[red airplane] terminal]

• [red [airplane terminal]]

Moxie saw Toxie in the park with a telescope

• Moxie saw [Toxie in the park with a telescope]

• Moxie [saw] Toxie in the park [with a telescope]

Language and Information 11/06/02 Nina Wacholder 11

Natural language processing (NLP)

Natural language Computer language

Language and Information 11/06/02 Nina Wacholder 12

The NLP controversy: rules vs. statistics

Language and Information 11/06/02 Nina Wacholder 13

NLP by rule

Lexicon (vocabulary) Det: a ProperName: Moxie Noun: report Verb: wrote

Syntactic rules NounPhrase[a report] Det[a] Noun[report] NounPhrase[Moxie] ProperName[Moxie] VerbPhrase[wrote a report] Verb[wrote] NounPhrase[a

report] Sentence[Moxie wrote a report] NounPhrase[Moxie]

VerbPhrase[wrote a report]

Language and Information 11/06/02 Nina Wacholder 14

NLP by statistics

Luhn (1958) tf*idf (Salton and Buckley 1988) Maximum entropy (Berger, Della Pietra and Della Pietra 1996)

Language and Information 11/06/02 Nina Wacholder 15

Information-access tasks with significant natural language component

Information retrieval Information extraction Automatic summarization Question answering

Language and Information 11/06/02 Nina Wacholder 16

Sparck Jones (2001)

Task core vs. task context Information retrieval: 30-40% accuracy for systems in natural

environment Information extraction: 50% for core systems Automatic summarization: no sound basis for core evaluation

Language and Information 11/06/02 Nina Wacholder 17

Evaluation of Head Sorting Mechanism Wacholder, Klavans and Evans (2000)

Task

compare domain-independent, corpus-independent methods for automatic identification of terms to represent a document or collection of documents

Methods for term identification

Head-sorted NPs (HS) (Wacholder 1998)

Keywords (KW)

Technical Terms (TT) (Justeson and Katz 1995)

Language and Information 11/06/02 Nina Wacholder 18

Examples of terms identified by indexing method

Keywords Head-sorted NPs Technical terms

asbestos/asbestosis workers cancer deaths

worker/workers/worked asbestos workers lung cancer

cancer 160 workers kent cigarette

death cancer dr. talcott

make lung cancer cigarette filter

lorillard asbestos u.s.

fiber cancer causing asbestos

dr. lung cancer deaths

… ...

Language and Information 11/06/02 Nina Wacholder 19

Ranking of terms by cumulative percentage

00.10.20.30.40.50.60.70.80.9

1

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Rating

Cum

ulat

ive

Perc

enta

ge

KWD

TT

SNP

Language and Information 11/06/02 Nina Wacholder 20

Ranking by cumulative number of terms

1 = best; 5 = worst

Number of termsranked at or better than

Method 2 3 4 5KW 27 75 124 166HS 41 96 132 160TT 15 21 21 21

Language and Information 11/06/02 Nina Wacholder 21

Summary of results

Head-sorted terms mixed quality terms good document coverage

Technical terms high quality terms poor document coverage

Keywords low quality terms good document coverage

Language and Information 11/06/02 Nina Wacholder 22

ISATC Pilot Project

Nina Wacholder, PIPhD Students: Lu Liu, Mark Sharp, Peng Song,

Xiaojun Yuan

Language and Information 11/06/02 Nina Wacholder 23

Research question

Null hypothesis: Properties of index terms do not affect information seeker’s selection of terms

What properties of index terms affect the selection of terms?

What effects do these properties have?

Language and Information 11/06/02 Nina Wacholder 24

Material

TextRice, McCreadie and Chang (2001)

Index termsHead sorted terms (Wacholder 1998)Technical terms (Justeson and Katz)Human index terms

Language and Information 11/06/02 Nina Wacholder 25

Experimental Searching and Browsing Interface (ESBI)

http://www.scils.rutgers.edu/cgi-bin/indexer.cg

Language and Information 11/06/02 Nina Wacholder 26

Initial results

Language and Information 11/06/02 Nina Wacholder 27

Future work

Further analysis of experimental data Compare subjects by type (e.g., undergraduate, MLIS) Effectiveness of searches (ie did they get the right answer) Overlap of words in index terms with words in question …

Evaluation of ESBI interface Comparison of additional techniques for identifying terms Use of different texts