Automatic Indexing

Automatic IndexingHsin-Hsi Chen

Indexingindexing: assign identifiers to text items.assign: manual vs. automatic indexingidentifiers:objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, controlled vs. uncontrolled vocabularies instruction manuals, terminological schedules, single-term vs. term phrase

Two IssuesIssue 1: indexing exhaustivityexhaustive: assign a large number of termsnonexhaustiveIssue 2: term specificitybroad terms (generic) cannot distinguish relevant from nonrelevant itemsnarrow terms (specific) retrieve relatively fewer items, but most of them are relevant

Parameters of retrieval effectivenessRecall Precision Goal high recall and high precision

NonrelevantItemsRelevantItemsRetrievedPartabcd

A Joint MeasureF-score is a parameter that encode the importance of recall and procedure.=1: equal weight>1: precision is more important

Choices of Recall and PrecisionBoth recall and precision vary from 0 to 1.In principle, the average user wants to achieve both high recall and high precision.In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.

Choices of Recall and Precision (Continued)Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall.In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

Term-Frequency ConsiderationFunction wordsfor example, "and", "or", "of", "but", the frequencies of these words are high in all textsContent wordswords that actually relate to document content varying frequencies in the different texts of a collectindicate term importance for content

A Frequency-Based Indexing MethodEliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words.Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di.Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

DiscussionsHigh-frequency terms favor recallhigh precision the ability to distinguish individual documents from each otherhigh-frequency terms good for precision when its term frequency is not equally high in all documents.

Inverse Document FrequencyInverse Document Frequency (IDF) for term Tj where dfj (document frequency of term Tj) is number of documents in which Tj occurs.fulfil both the recall and the precisionoccur frequently in individual documents but rarely in the remainder of the collection

New Term Importance Indicatorweight wij of a term Tj in a document ti Eliminating common function wordsComputing the value of wij for each term Tj in each document DiAssigning to the documents of a collection all terms with sufficiently high (tf x idf) factors

Term-discrimination ValueUseful index terms distinguish the documents of a collection from each otherDocument Spacetwo documents are assigned very similar term sets, when the corresponding points in document configuration appear close togetherwhen a high-frequency term without discrimination is assigned, it will increase the document space density

Original StateAfter Assignment of good discriminatorAfter Assignment of poor discriminatorA Virtual Document Space

Good Term AssignmentWhen a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection.This should increase the average distance between the items in the collection and hence produce a document space less dense than before.

Poor Term AssignmentA high frequency term is assigned that does not discriminate between the items of a collection.Its assignment will render the document more similar.This is reflected in an increase in document space density.

Term Discrimination Valuedefinition dvj = Q - Qj whereQ and Qj are space densities before and after the assignments of term Tj. dvj>0, Tj is a good term; dvj

DocumentFrequencyLow frequencydvj=0Medium frequencydvj>0High frequencydvj

Another Term Weightingwij = tfij x dvjcompared with : decrease steadily with increasing document frequencydvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

Document CentroidIssue: efficiency problem N(N-1) pairwise similaritiesdocument centroid C = (c1, c2, c3, ..., ct) where dkj is the j-th term in document k.space density

Discussionsdvj and idfj global properties of terms in a document collectionideal term characteristics that occur between relevant and nonrelevant items of a collection

Probabilistic Term WeightingGoal Explicit distinctions between occurrences of terms in relevant and nonrelevant items of a collectionDefinition Consider a collection of document vectors of the form x = (x1,x2,x3,...,xt)

Probabilistic Term WeightingPr(x|rel), Pr(x|nonrel): occurrence probabilities of item x in the relevant and nonrelevant document setsPr(rel), Pr(nonrel): items a priori probabilities of relevance and nonrelevanceFurther assumptions Terms occur independently in relevant documents; terms occur independently in nonrelevant documents.

Derivation Process

Given a document D=(d1, d2, , dt), the retrieval value of D is:where di: term weights of term xi.

Assume di is either 0 or 1.0: i-th term is absent from D.1: i-th term is present in D.pi=Pr(xi=1|rel)1-pi=Pr(xi=0|rel)qi=Pr(xi=1|nonrel)1-qi=Pr(xi=0|nonrel)

The retrieval value of each Tj present in a document(i.e., dj=1) is: term relevance weight

New Term Weightingterm-relevance weight of term Tj: trjindexing value of term Tj in document Dj: wij = tfij *trjIssue It is necessary to characterize both the relevant and nonrelevant documents of a collection. how to find a representative document sample feedback information from retrieved documents??

Estimation of Term-RelevanceLittle is known about the relevance properties of terms.The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collection qj = dfj / NThe occurrence probabilities of the terms in the small number of relevant items is equal by using a constant value pj = 0.5 for all j.

When N is sufficiently large, N-dfj N, = idf

Estimation of Term-RelevanceEstimate the number of relevant items rj in the collection that contain term Tj as a function of the known document frequency tfj of the term Tj. pj = rj / R qj = (dfj-rj)/(N-R) R: an estimate of the total number of relevant items in the collection.

Term Relationships in IndexingSingle-term indexingSingle terms are often ambiguous.Many single terms are either too specific or too broad to be useful.Complex text identifierssubject experts and trained indexerslinguistic analysis algorithms, e.g., NP chunkerterm-grouping or term clustering methods

Tree-Dependence ModelOnly certain dependent term pairs are actually included, the other term pairs and all higher-order term combinations being disregarded.Example: sample term-dependence tree

(children, school), (school, girls), (school, boys) (children, girls), (children, boys) (school, girls, boys), (children, achievement, ability)

Term Classification (Clustering)

Term Classification (Clustering)Column part Group terms whose corresponding column representation reveal similar assignments to the documents of the collection.Row part Group documents that exhibit sufficiently similar term assignment.

Linguistic MethodologiesIndexing phrases: nominal constructions including adjectives and nounsAssign syntactic class indicators (i.e., part of speech) to the words occurring in document texts.Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun-noun and adjective-noun sequences).

Term-Phrase FormationTerm Phrase a sequence of related text words carry a more specific meaning than the single terms e.g., computer science vs. computer; DocumentFrequencyLow frequencydvj=0Medium frequencydvj>0High frequencydvj

Simple Phrase-Formation Processthe principal phrase component (phrase head) a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator valuethe other components of the phrase medium- or low- frequency terms with stated co-occurrence relationships with the phrase headcommon function words not used in the phrase-formation process

An ExampleEffective retrieval systems are essential for people in need of information.are, for, in and of: common function wordssystem, people, and information: phrase heads

The Formatted Term-Phraseseffective retrieval systems essential people need information*: phrases assumed to be useful for content identification2/55/12

The ProblemsA phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words in not likely to generate a large number of high-quality phrases.Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.

Additional Term-Phrase Formation StepsSyntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective-noun and noun-noun sequences. Adverb-adjective adverb-noun The phrase elements are all chosen from within the same syntactic unit, such as subject phrase, object phrase, and verb phrase.

Consider Syntactic Uniteffective retrieval systems are essential for people in the need of informationsubject phraseeffective retrieval systemsverb phraseare essentialobject phrasepeople in need of information

Phrases within Syntactic ComponentsAdjacent phrase heads and components within syntactic componentsretrieval systems*people needneed information*Phrase heads and components co-occur within syntactic componentseffective systems[subj effective retrieval systems] [vp are essential ]for [obj people need information]2/3

ProblemsMore stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies.Prepositional phrase attachment, e.g., The man saw the girl with the telescope.Anaphora resolution He dropped the plate on his foot and broke it.

Problems (Continued)Any phrase matching system must be able to deal with the problems ofsynonym recognitiondiffering word ordersintervening extraneous wordExampleretrieval of information vs. information retrieval

Equivalent Phrase FormulationBase form: text analysis systemVariants:system analyzes the texttext is analyzed by the systemsystem carries out text analysistext is subjected to system analysisRelated term substitutiontext: documents, information itemsanalysis: processing, transformation, manipulationsystem: program, process

Thesaurus-Group GenerationThesaurus transformationbroadens index terms whose scope is too narrow to be useful in retrievala thesaurus must assemble groups of related specific terms under more general, higher-level class indicatorsDocumentFrequencyLow frequencydvj=0Medium frequencydvj>0High frequencydvj

Sample Classes of Rogets Thesaurus

Methods of Thesaurus Constructiondij: value of term Tj in document Disim(Tj, Tk): similarity measure between Tj and Tkpossible measures

Term-classification strategiessingle-link Each term must have a similarity exceeding a stated threshold value with at least one other term in the same class. Produce fewer, much larger term classescomplete-link Each term has a similarity to all other terms in the same class that exceeds the the threshold value. Produce a large number of small classes

The Indexing Prescription (1)Identify the individual words in the document collection.Use a stop list to delete from the texts the function words.Use an suffix-stripping routine to reduce each remaining word to word-stem form.For each remaining word stem Tj in document Di, compute wij.Represent each document Di by Di=(T1, wi1; T2, wi2; , Tt, wit)

Word Stemmingeffectiveness --> effective --> effectpicnicking --> picnicking -\-> k

Some Morphological RulesRestore a silent e after suffix removal from certain words to produce hope from hoping rather than hopDelete certain doubled consonants after suffix removal, so as to generate hop from hopping rather than hopp.Use a final y for an I in forms such as easier, so as to generate easy instead of easi.

The Indexing Prescription (2)Identify individual text words.Use stop list to delete common function words.Use automatic suffix stripping to produce word stems.Compute term-discrimination value for all word stems.Use thesaurus class replacement for all low-frequency terms with discrimination values near zero.Use phrase-formation process for all high-frequency terms with negative discrimination values.Compute weighting factors for complex indexing units.Assign to each document single term weights, term phrases, and thesaurus classes with weights.

Query vs. DocumentDifferencesQuery texts are short.Fewer terms are assigned to queries.The occurrence of query terms rarely exceeds 1.Q=(wq1, wq2, , wqt) where wqj: inverse document frequencyDi=(di1, di2, , dit) where dij: term frequency*inverse document frequency

Query vs. DocumentWhen non-normalized documents are used, the longer documents with more assigned terms have a greater chance of matching particular query terms than do the shorter document vectors. or

Relevance FeedbackTerms present in previously retrieved documents that have been identified as relevant to the users query are added to the original formulations.The weights of the original query terms are altered by replacing the inverse document frequency portion of the weights with term-relevance weights obtained by using the occurrence characteristics of the terms in the previous retrieved relevant and nonrelevant documents of the collection.

Relevance FeedbackQ = (wq1, wq2, ..., wqt)Di = (di1, di2, ..., dit)New query may be the following form Q = a{wq1, wq2, ..., wqt}+b{wqt+1, wqt+2, ..., wqt+m}The weights of the newly added terms Tt+1 to Tt+m may consist of a combined term-frequency and term-relevance weight.

Final IndexingIdentify individual text words.Use a stop list to delete common words.Use suffix stripping to produce word stems.Replace low-frequency terms with thesaurus classes.Replace high-frequency terms with phrases.Compute term weights for all single terms, phrases, and thesaurus classes.Compare query statements with document vectors.Identify some retrieved documents as relevant and some as nonrelevant to the query.

Final IndexingCompute term-relevance factors based on available relevance assessments.Construct new queries with added terms from relevant documents and term weights based on combined frequency and term-relevance weight.Return to step (7). Compare query statements with document vectors ..

Summary of expected effectiveness of automatic indexingBasic single-term automatic indexing-Use of thesaurus to group related terms in the given topic area+10% to +20%Use of automatically derived term associations obtained from joint term assignments found in sample document collections0% to -10%Use of automatically derived term phrases obtained by using co-occurring terms found in the texts of sample collections+5% to +10%Use of one iteration of relevant feedback to add new query terms extracted from previously retrieved relevant documents+30% to +60%

Fast Statistical Parsing of Noun Phrases for Document IndexingChengxiang ZhaiLaboratory for Computational LinguisticsCarnegie Mellon University(ANLP97, pp. 312-319)

Phrases for Document IndexingIndexing by single wordssingle words are often ambiguous and not specific enough for accurate discrimination of documentsbank terminology vs. terminology bankIndexing by phrasesSyntactic phrases are almost always more specific than single wordsIndexing by single words and phrases

No significant improvement?Fagan, Joel L., Experiments in Automatic Phrase Indexing for Document Retrieval: A Comparison of Syntactic and Non-syntactic methods, Ph.D. thesis, Cornel University, 1987.Lewis, D., Representation and Learning in Information Retrieval, Ph.D. thesis, University of Massachusetts, 1991.Many syntactic phrases have very low frequency and tend to be over-weighted by normal weighting method.

authors pointsA larger document collection may increase the frequency of most phrases, and thus alleviate the problem of low frequency.The phrases are used only to supplement, not replace the single words for indexing.The new issue is: how to parse gigabytes of text in practically feasible time. (133MH DEC alpha workstation, 8 hours/GB, 20 hours of training with 1GB text.

Experiment DesignCLARIT commercial retrieval system{original document set} ----> CLARIT NP Extractor ----> {Raw Noun Phrases} ----> Statistical NP Parser, Phrase Extractor ----> {Indexing Term Set} ----> CLARIT Retrieval Engine

Different Indexing Unitsexample[[[heavy construction] industry] group] (WSJ90)single wordsheavy, construction, industry, grouphead modifier pairsheavy construction, construction industry, industry groupfull noun phrasesheavy construction industry group

Different Indexing Units (Continued)WD-SETsingle word only (no phrases, baseline)WD-HM-SETsingle word + head modifier pairWD-NP-SETsingle word + full NPWD-HM-NP-SETsingle word + head modifier + full NP

Result AnalysisCollection: Tipster Disk 2 (250MB)Query: TREC-5 ad hoc topics (251-300)relevance feedback: top 10 documents returned from initial retrievalevaluationtotal number of relevant documents retrievedhighest level of precision over all the points of recallaverage precision

Effects of phraseswith feedback and TREC-5

Experiments

Recall

Init Prec

Avg Prec

WD-SET

0.56(597)

0.4546

0.2208

WD-HM-SET

Inc over WD-SET

0.60 (638)

7%

0. 5162

14%

0. 2402

9%

WD-NP-SET

Inc over WD-SET

0.58 (613)

4%

0. 5373

18%

0. 2564

16%

WD-HM-NP-SET

Inc over WD-SET

0.63 (666)

13%

0. 4747

(4%)

0. 2285

(3%)

Total relevant documents: 1064

SummaryWhen only one kind of phrase is used to supplement the single words, each can lead to a great improvement in precision.When we combine the two kinds of phrases, the effect is a greater improvement in recall rather than precision.How to combine and weight different phrases effectively becomes an important issue.

A Corpus-Based Statistical Approach toAutomatic Book IndexingJyun-Sheng Chang, Tsung-Yih Tseng, Ying Cheng, Huey-Chyun Chen, Shun-Der Cheng, Sur-Jin Ker, and John S. Liu(ANLP92, pp. 147-151)

Generating IndicesWord SegmentationPart-of-speech taggingFinding noun phrases

Example of Problem DescriptionSegmentation, tagging, noun phrase finding /////// P/Q/CL/LOC/CTM/NC/NC/ /////////// P/D/Q/CL/NC/LOC/LOC/LOC/V/CTM/NC/ ////// NP/ADV/V/NC/CTM/NC/ /////// P/NC/CTM/NC/CTM/NC/NC/

Word Segmentation Given a Chinese sentence, segment the sentence into words.///////

Segmentation as a Constraint Satisfaction ProblemGiven a sequence of Chinese characters, C1, C2, , Cn, assign break/continue to each place Xi between two adjacent characters Ci and Ci+1.>==>>=>=>>>= break: >, continue: =

Detail specificationFor each sequence of characters Ci, , Cj which are a Chinese word in the dictionary or a surname-name, if j=i then put (>, >) in Ki-1,i. If j > I, the put (>,=) in Ki-1, i, (=, =) in Ki, i+1, , and Kj-1, j.

Example012345678910111213assume , , , , , ,, , ,, , are words by dictionary lookup0(>,>)1(>,=)2(=,>), (=,=)3(=,>)4(>,>)5(>,>)6(>,>), (>,=)7(>,>), (=,>), (>, =)8 (>,>), (=,>), (>, =) 9 (>,>), (=,>), (>, =) 10 (>,>), (=,>), (>, =) 11(>,>)12(>,>), (>, =)13(=, >)

Differences from English IRData analysis issuemedia: syllable structure in speech datacode & character: GB and BIG conversion, rigid semantic in characterword: simple in word stemming, spelling, hard in word segmentation, proper noun identification(Dr. L.F. Chien, 1996)

Differences from English IR (Continued)Interface issuesinput: eager in speech and OCR inputquery: need searching for approximate terms, rigid information in NLQ, hard to find proper noun in NLQ

Differences from English IR (Continued)Indexing and searching issuesindex: hard to use word-level and complete index like inverted file workable to use character-level and filtering index like signature (Chien, 1995)searching: need multiple-stage searching, need best match in term level

Segmentation ProblemSegmentation is a serious problem in processing Chinese sentences(Hsin-Hsi Chen, 1996)

StrategiesDictionary lookup supplemented by other special strategiesthe-longest-word-the-firstthe number of words...

Documents

Automatic Indexing