140
BIBLIOGRAPHY 129 Y AROWSKY ,DAVID. 1992. Word sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics, 454–460, Nantes, France. YOUMANS,GILBERT. 1991. A new tool for discourse analysis: The vocabulary-management profile. Language 67.763–789.

BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 129

YAROWSKY, DAVID . 1992. Word sensedisambiguationusingstatisticalmodelsof Roget’scategoriestrainedon large corpora. In Proceedings of the Fourteenth InternationalConference on Computational Linguistics, 454–460,Nantes,France.

YOUMANS, GILBERT. 1991.A new tool for discourseanalysis:Thevocabulary-managementprofile. Language 67.763–789.

Page 2: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 128

THOMPSON, R. H., & B. W. CROFT. 1989. Supportfor browsing in an intelligent textretrieval system.International Journal of Man [sic] -Machine Studies 30.639–668.

TOMBAUGH, J.,A. LICKORISH, & P. WRIGHT. 1987.Multi-window displaysfor readersoflengthytexts. International Journal of Man [sic] -Machine Studies 26.597–615.

TUFTE, EDWARD. 1983. The visual display of quantitative information. Chelshire,CT:GraphicsPress.

VAN DIJK, TEUN A. 1980.Macrostructures. Hillsdale,N.J.: LawrenceErlbaumAssociates.

——. 1981.Studies in the pragmatics of discourse. TheHague:MoutonPublishers.

VAN RIJSBERGEN, C. J. 1979.Information retrieval. London:Butterworths.

VOORHEES, ELLEN M. 1985. The cluster hypothesisrevisited. In Proceedings ofACM/SIGIR, 188–196.

WALKER, MARILYN. 1991.Redundancy in collaborativedialogue.In AAAI Fall Symposiumon Discourse Structure in Natural Language Understanding and Generation, ed.byJuliaHirschberg, DianeLitman,KathyMcCoy, & CandySidner, PacificGrove,CA.

WANG, MICHELLE Q.,& JULIA HIRSCHBERG. 1992.Automaticclassificationof intonationalphraseboundaries.Computer Speech and Language 6.175–196.

WEBBER, BONNIE LYNN. 1987.Theinterpretationof tensein discourse.In Proceedings ofthe 25th Annual Meeting of the Association for Computational Linguistics, 147–154.

WILENSKY, ROBERT. 1981. Meta-planning: representingand using knowledgeaboutplanningin problemsolvingandnaturallanguageunderstanding.Cognitive Science5.197–235.

——. 1983a.Planning and understanding. Reading,MA: Addison-Wesley.

——. 1983b. Storygrammarsvs.storypoints.The Behavior and Brain Sciences 6.

——, YIGAL ARENS, & DAVID N. CHIN. 1984. Talking to UNIX in English: An overviewof UC. Communications of the ACM 27.

WILKS, YORICK. 1975.An intelligentanalyzerandunderstanderof English.Communica-tions of the ACM 18.264–274.

WILKS, YORICK A., DAN C. FASS, CHENG MING GUO, JAMES E.MCDONALD , TONY PLATE,& BRIAN M. SLATOR. 1990.Providing machinetractabledictionarytools. Journal ofComputers and Translation 2.

WINOGRAD, TERRY. 1972. Understanding natural language. New York, NY: AcademicPress.

WU, SUN, & UDI MANBER. 1992. Agrep– a fastapproximatepattern-matchingtool. InProceedings of the Winter 1992 USENIX Conference, 153–162,SanFrancisco,CA.

Page 3: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 127

SIBUN, PENELOPE. 1992. Generatingtext without trees. Computational Intelligence:Special Issue on Natural Language Generation 8.102–122.

SIDNER, CANDACE L. 1983. Focusingin the comprehensionof definite anaphora. InComputational models of discourse, ed. by Michael Brady & RobertC. Berwick,267–330.Cambridge,MA: MIT Press.

SKOROCHOD’KO, E.F. 1972. Adaptive methodof automaticabstractingandindexing. InInformation Processing 71: Proceedings of the IFIP Congress 71, ed.byC.V. Freiman,1179–1182.North-HollandPublishingCompany.

SMADJA, FRANK A., & KATHLEEN R. MCKEOWN. 1990. Automaticallyextracting andrepresentingcollocationsfor languagegeneration.In Proceedings of the 28th AnnualMeeting of the Association for Computational Linguistics, 252–259.

SPARCK-JONES, KAREN. 1971.Automatic keyword classification for information retrieval.London:Butterworth& Co.

——. 1986.Synonymy and semantic classification. Edinburgh: EdinburghUniversityPress.

SPOERRI, ANSELM. 1993. InfoCrystal: A visual tool for informationretrieval & manage-ment. In Proceedings of Information Knowledge and Management ’93, Washington,D.C.

STANFILL , CRAIG, & DAVID L. WALTZ. 1992.Statisticalmethods,artificial intelligence,andinformationretrieval. In Text-based intelligent systems: Current research and practicein information extraction and retrieval, ed. by Paul S. Jacobs,215–226.LawrenceErlbaumAssociates.

STARK, HEATHER. 1988.Whatdoparagraphmarkersdo? Discourse Processes 11.275–304.

STODDARD, SALLY. 1991.Text and texture: Patterns of cohesion, volumeXL of Advancesin Discourse Processes. Norwood,NJ:Ablex PublishingCorporation.

STONEBRAKER,MICHAEL, & G.KEMNITZ. 1991.ThePOSTGRESnext-generationdatabasemanagementsystem.Communications of the ACM 34.78–92.

SUNDHEIM, BETH. 1990. Secondmessageunderstandingconference(MUC-II).TechnicalReport1328,Naval OceanSystemsCenter, SanDiego,CA.

SVENONIUS, ELAINE. 1986.Unansweredquestionsin thedesignof controlledvocabularies.Journal of the American Society for Information Science 37.331–340.

TANNEN, DEBORAH. 1984.Conversational style: Analyzing talk among friends. Norwood,NJ:Ablex.

——. 1989.Talking voices: Repetition, dialogue, and imagery in conversational discourse.Studiesin InteractionalSociolinguistics6. CambridgeUniversityPress.

TENOPIR, CAROL, & JUNG SOON RO. 1990. Full text databases. New Directions inInformationManagement.GreenwoodPress.

Page 4: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 126

RUS, DANIELA , & DEVIKA SUBRAMANIAN . 1993. Multi-mediaRISSCinformatics: Re-trieving informationwith simplestructuralcomponents.In Proceedings of the ACMConference on Information and Knowledge Management (CIKM).

SALTON, GERARD (ed.) 1971. The Smart retrieval system – experiments in automaticdocument processing. EnglewoodClif fs, NJ: PrenticeHall.

——. 1972.Experimentsin automaticthesaurusconstructionfor informationretrieval. InInformation Processing 71, 115–123.NorthHollandPublishingCo.

——. 1988. Automatic text processing : the transformation, analysis, and retrieval ofinformation by computer. Reading,MA: Addison-Wesley.

——, JAMES ALLAN , & CHRIS BUCKLEY. 1993.Approachesto passageretrieval in full textinformationsystems.In Proceedings of the 16th Annual International ACM/SIGIRConference, 49–58,Pittsburgh,PA.

——, JAMES ALLAN , & CHRIS BUCKLEY. 1994.Automaticstructuringandretrievalof largetext files. Communications of the ACM 37.97–108.

——, & CHRIS BUCKLEY. 1990. Improving retrieval performanceby relevancefeedback.JASIS 41.288–297.

——, & CHRIS BUCKLEY. 1991. Automatictext structuringandretrieval: Experimentsinautomaticencyclopediasearching.In Proceedings of the 14th Annual InternationalACM/SIGIR Conference, 21–31.

——, & CHRISBUCKLEY. 1992.Automatictext structuringexperiments.In Text-based intel-ligent systems: Current research and practice in information extraction and retrieval,ed.by PaulS.Jacobs,199–209.LawrenceErlbaumAssociates.

SCHANK,ROGER, & R.ABELSON. 1977.Scripts, plans, goals, and understanding. Hillsdale,NJ:Erlbaum.

SCHIFFRIN, DEBORAH. 1987.Discourse markers. Cambridge:CambridgeUniversityPress.

SCHUTZE, HINRICH. 1993a.Part-of-speechinductionfrom scratch.In Proceedings of ACL31, OhioStateUniversity.

——. 1993b. Word space.In Advances in neural information processing systems 5, ed.by StephenJ. Hanson,JackD. Cowan, & C. Lee Giles. SanMateo CA: MorganKaufmann.

SENAY,HIKMET, & EVE IGNATIUS. 1990.Rulesandprinciplesof scientificdatavisualization.TechnicalReportGWU-IIST-90-13,Institutefor InformationScienceandTechnology,TheGeorgeWashingtonUniversity.

SHNEIDERMAN, BEN. 1987. Designing the user interface: strategies for effective human-computer interaction. Reading,MA: Addison-Wesley.

Page 5: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 125

PUSTEJOVSKY, JAMES. 1987. On theacquisitionof lexical entries:Theperceptualoriginof thematicrelations.Proceedings of the 25th Annual Meeting of the Association forComputational Linguistics .

RASKIN, VICTOR, & IRWIN WEISER. 1987. Language and writing: Applications of lin-guistics to rhetoric and composition. Norwood, New Jersey: ABLEX PublishingCorporation.

RESNIK, PHILIP. 1992. WordNetandDistributional Analysis: A Class-basedApproachto Lexical Discovery. In Statistically-based natural language programming tech-niques: Papers from the 1992 workshop, ed.by Carl Weir. Menlo Park, CA: AAAIPress,TechnicalReportW-92-01.

——, 1993. Selection and information: A class-based approach to lexical relationships.Universityof Pennsylvaniadissertation.(Institutefor Researchin Cognitive SciencereportIRCS-93-42).

RILOFF, ELLEN, & WENDY LEHNERT. 1992. Classifyingtexts usingrelevancy signatures.In Proceedings of the Tenth National Conference on Artificial Intelligence. AAAIPress/TheMIT Press.

RO, JUNG SOON. 1988a. An evaluationof the applicability of ranking algorithmstoimprove the effectivenessof full-text retrieval. I. On the effectivenessof full-textretrieval. Journal of the American Society for Information Science 39.73–78.

——. 1988b. An evaluationof the applicability of ranking algorithmsto improve theeffectivenessof full-text retrieval. II. On the effectivenessof rankingalgorithmsonfull-text retrieval. Journal of the American Society for Information Science 39.147–160.

ROBERTSON, GEORGEC., STUART K. CARD, & JOCK D. MACKINLAY. 1993. Informationvisualizationusing3D interactiveanimation.Communications of the ACM 36.56–71.

ROLLING, L. 1981. Indexing consistency, quality, andefficiency. Information Processingand Management 17.69–76.

ROSE, DANIEL E., & RICHARD K. BELEW. 1991. Towarda direct-manipulationinterfacefor conceptualinformationretrieval systems.In Interfaces for information retrievaland online systems, ed.by Martin Dillon, 39–54.New York, NY: GreenwoodPress.

ROTONDO, JOHN A. 1984. Clusteringanalysisof subjective partitionsof text. DiscourseProcesses 7.69–88.

RUGE, GERDA. 1991.Experimentsonlinguisticallybasedtermassociations.In Proceedingsof the RIAO, 528–545.

RUMELHART, DAVID . 1975. Noteson a schemafor stories.In Representation and under-standing, ed.by DanielG. Bobrow & Allan Collins,211–236.New York: AcademicPress.

Page 6: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 124

MITTENDORF, ELKE, & PETERSCHAUBLE. 1994.Passageretrievalbasedonhiddenmarkovmodels. In Proceedings of the 17th Annual International ACM/SIGIR Conference,Dublin, Ireland.To appear.

MOFFAT, ALISTAIR, RONSACKS-DAVIS, ROSSWILKINSON, & JUSTINZOBEL. 1994.Retrievalof partialdocuments.In Proceedings of TREC-2, ed.by DonnaHarman.To appear.

MOONEY, DAVID J.,M. SANDRA CARBERRY, & KATHLEENF. MCCOY. 1990.Thegenerationof high-level structurefor extendedexplanations. In Proceedings of the ThirteenthInternational Conference on Computational Linguistics, volume2,276–281,Helsinki.

MOORE, JOHANNA D., & MARTHA E. POLLACK. 1992. A problemfor RST:Theneedformulti-level discourseanalysis.Computational Linguistics 18.

MORRIS, JANE. 1988.Lexical cohesion,thethesaurus,andthestructureof text. TechnicalReportCSRI-219,ComputerSystemsResearchInstitute,Universityof Toronto.

——, & GRAEME HIRST. 1991. Lexical cohesioncomputedby thesauralrelationsasanindicatorof thestructureof text. Computational Linguistics 17.21–48.

NOREAULT, TERRY, MICHAEL MCGILL , & MATTHEW B. KOLL. 1981.A performanceeval-uationof similarity measures,documenttermweightingschemesandrepresentationsin aBooleanenvironment.In Information retrieval research, ed.by R. N. Oddy, S.E.Robertson,C. J.vanRijsbergen,& P. W. Williams, 57–76.London:Butterworths.

NORVIG, PETER, 1987. A unified theory of inference for text understanding. Universityof California, Berkeley dissertation. (ComputerScienceDivision ReportNumber87/339).

O’CONNOR, J. 1980. Answerpassageretrieval by text searching.Journal of the ASIS32.227–239.

OUSTERHOUT, JOHN. 1991. An X11 toolkit basedon theTcl language.In Proceedings ofthe Winter 1991 USENIX Conference, 105–115,Dallas,TX.

PAICE, CHRIS D. 1990. Constructingliteratureabstractsby computer: Techniquesandprospects.Information Processing and Management 26.171–186.

PASSONNEAU, REBECCA J., & DIANE J. LITMAN . 1993. Intention-basedsegmentation:Humanreliability and correlationwith linguistic cues. In Proceedings of the 31stAnnual Meeting of the Association for Computational Linguistics, 148–155.

PEAT, HELEN J.,& PETER WILLETT. 1991. The limitationsof termco-occurencedataforqueryexpansionin documentretrieval systems.JASIS 42.378–383.

PHILLIPS, MARTIN. 1985.Aspects of text structure: An investigation of the lexical organi-sation of text. Amsterdam:North-Holland.

POLLOCK, J.J.,& A. ZAMORA. 1975.AutomaticabstractingresearchatChemicalAbstractsService.Journal of Chemical Information and Computer Sciences 15.226–233.

Page 7: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 123

MACKINLAY, JOCK, 1986.Automatic design of graphical presentations. StanfordUniversitydissertation.TechnicalReportStan-CS-86-1038.

MANBER, UDI, & SUN WU. 1994.GLIMPSE:a tool to searchthroughentirefile systems.In Proceedings of the Winter 1994 USENIX Conference, 23–31,SanFrancisco,CA.

MANN, WILLIAM C., & SANDRA A. THOMPSON. 1987. Rhetoricalstructuretheory: Atheoryof text organization.TechnicalReportISI/RS87-190,ISI.

MANNING, CHRISTOPHER D. 1993. Automaticacquisitionof a large subcategorizationdictionaryfrom corpora.In Proceedings of the 31st Annual Meeting of the Associationfor Computational Lingusitics, 235–242,Columbus,OH.

MARCHIONINI, GARY, PETER LIEBSCHER, & XIA LIN. 1991. Authoringhyperdocuments:Designingfor interaction.In Interfaces for information retrieval and online systems,ed.by Martin Dillon, 119–131.New York,NY: GreenwoodPress.

MARKEY, KAREN, PAULINE ATHERTON, & CLAUDIA NEWTON. 1982. An analysisofcontrolledvocabulary and free text searchstatementsin online searches. OnlineReview 4.225–236.

MARKOWITZ, JUDITH, THOMAS AHLSWEDE, & MARTHA EVENS. 1986. Semanticallysignificantpatternsin dictionarydefinitions.Proceedings of the 24th Annual Meetingof the Association for Computational Linguistics 112–119.

MARTIN, JAMES H. 1990. A computational model of metaphor interpretation. Boston:AcademicPress.

MASAND, BRIJ, GORDON LINOFF, & DAVID WALTZ. 1992. Classifyingnews storiesusingmemorybasedreasoning.In Proceedings of ACM/SIGIR, 59–65.

MAULDIN, MICHAEL L., 1989. Information retrieval by text skimming. Pittsburg, PA:CarnegieMellon Universitydissertation.

——. 1991. Retrieval performancein ferret. In Proceedings of ACM/SIGIR, 347–355,Chicago,IL.

MCCUNE, B., R. TONG, J.S.DEAN, & D. SHAPIRO. 1985.Rubric: A systemfor rule-basedinformationretrieval. IEEE Transactions on Software Engineering 11.

MICHARD, A. 1982. Graphicalpresentationof Booleanexpressionsin a databasequerylanguage:designnotesand an ergonomicevaluation. Behaviour and InformationTechnology 1.

MILLER, GEORGE A., RICHARD BECKWITH, CHRISTIANE FELLBAUM, DEREK GROSS, &KATHERINE J. MILLER. 1990. Introductionto WordNet: An on-linelexical database.Journal of Lexicography 3.235–244.

MINSKY, MARVIN . 1975. A framework for representingknowledge.In The psychology ofcomputer vision, ed.by PatrickWinston.McGraw-Hill.

Page 8: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 122

KOZIMA, HIDEKI. 1993. Text segmentationbasedon similarity betweenwords. In Pro-ceedings of the 31th Annual Meeting of the Association for Computational Linguistics,286–288,Columbus,OH.

KUNO, SUSUMO. 1972. Functionalsentenceperspective: A casestudyfrom JapaneseandEnglish.Linguistic Inquiry 3.269–320.

KUPIEC, JULIAN . 1993.MURAX: A robustlinguisticapproachfor questionansweringusinganon-lineencyclopedia.In Proceedings of the 16th Annual International ACM/SIGIRConference, 181–190,Pittsburgh,PA.

LAKOFF, GEORGEP. 1972.Structuralcomplexity in fairy tales.The Study of Man 1.128–150.

LAMBERT, LYNN, & SANDRA CARBERRY. 1991.A tripartiteplan-basedmodelof dialogue.In Proceedings of the 29th Annual Meeting of the Association for ComputationalLinguistics, 47–54.

LANCASTER, F. 1986.Vocabulary control for information retrieval, second edition. Arling-ton,VA: InformationResources.

LARSON, RAY R. 1991. Classificationclustering,probabilisticinformationretrieval, andtheonlinecatalog.The Library Quarterly 61.133–173.

——. 1992. Experimentsin automaticlibrary of congressclassification.Journal of theAmerican Society for Information Science 43.130–148.

LEWIS, DAVID D. 1992. An evaluationof phrasalandclusteredrepresentationson a textcategorizationtask. In Proceedings of the 15th Annual International ACM/SIGIRConference, 37–50,Copenhagen.

LIDDY, ELIZABETH. 1991. The discourselevel structureof empiricalabstracts– an ex-ploratorystudy. Information Processing and Management 27.55–81.

LIDDY, ELIZABETH D., & S. MYAENG. 1993. DR-LINK’ s linguistic-conceptualapproachto documentdetection.In The first text retrieval conference (TREC-1), ed.by DonnaHarman,113–129.NIST SpecialPublication500-207.

——, & WOOJINPAIK . 1992.Statistically-guidedwordsensedisambiguation.In Proceed-ings of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language.

LONGACRE, R. E. 1979. Theparagraphasa grammaticalunit. In Syntax and semantics:Discourse and syntax, ed.by Talmy Givon,volume12,115–134.AcademicPress.

LUPERFOY, SUSANN. 1992. The representationof multimodal user interfacedialoguesusing discoursepegs. In Proceedings of the 30th Meeting of the Association forComputational Linguistics, 22–31.

LYNCH, CLIFFORD. 1992.Thenext generationof publicaccessinformationretrievalsystemsfor researchlibraries – lessonsfrom 10 yearsof the melvyl system. InformationTechnology and Libraries 11.405–415.

Page 9: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 121

HINDS, JOHN. 1979. Organizationalpatternsin discourse. In Syntax and semantics:Discourse and syntax, ed.by Talmy Givon,volume12,135–158.AcademicPress.

HIRSCHBERG, JULIA , & DIANE LITMAN . 1993.Empiricalstudieson thedisambiguationofcuephrases.Computational Linguistics 19.501–530.

HOBBS, JERRY. 1978.Resolvingpronounreferences.Lingua 44.311–338.

HOVY, ED. 1990. Parsimoniousandprofligateapproachesto the questionof discoursestructurerelations.In 5th ACL Workshop on Natural Language Generation, Dawson,Pennsylvania.

HWANG, CHUNG HEE, & LENHART K. SCHUBERT. 1992.Tensetreesasthe’fine structure’of discourse.In Proceedings of the 30th Meeting of the Association for ComputationalLinguistics, 232–240.

JACOBS, PAUL. 1993.Usingstatisticalmethodsto improveknowledge-basednewscatego-rization. IEEE Expert 8.13–23.

——, & LISA RAU. 1990.SCISOR:Extractinginformationfrom On-LineNews. Commu-nications of the ACM 33.88–97.

JURAFSKY, DANIEL, 1992. An on-line computational model of human sentence interpre-tation: A theory of the representation and use of linguistic knowledge. Universityof Californiaat Berkeley dissertation.(ComputerScienceDivision ReportNumber92/676).

JUSTESON, J.S.,& S.M. KATZ. 1991.Co-occurrencesof antonymousadjectivesandtheircontexts. Computational Lingustics 17.1–19.

KAHLE, BREWSTER, & ART MEDLAR. 1991. An informationsystemfor corporateusers:Wide areainformationservers. TechnicalReportTMC199,Thinking MachinesCor-poration.

KEEN, E. MICHAEL. 1991. Theuseof termpositiondevicesin rankedoutputexperiment.Journal of Documentation 47.1–22.

——. 1992. Term positionranking: somenew testresults. In Proceedings of the 15thAnnual International ACM/SIGIR Conference, 66–76,Copenhagen,Denmark.

KOLODNER, JANET L. 1983. Maintainingorganizationin a dynamiclong-termmemory.Cognitive Science 7.243–280.

KORFHAGE, ROBERT R. 1991. To seeor not to see– is thatthequery? In Proceedings ofthe 14th Annual International ACM/SIGIR Conference, 134–141,Chicago.

KOSSLYN, S., S. PINKER, W. SIMCOX, & L. PARKIN . 1983. Understanding charts andgraphs: A project in applied cognitive science. NationalInstituteof Education.ED1.310/2:238687.

Page 10: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 120

GREFENSTETTE, G. 1992. A new knowledge-poortechniquefor knowledgeextractionfrom large corpora. In Proceedings of the 15th Annual International ACM/SIGIRConference, Copenhagen,Denmark.ACM.

GRIFFITHS, ALAN , H. CLAIRE LUCKHURST, & PETERWILLETT. 1986.Usinginterdocumentsimilarity informationin documentretrievalsystems.Journal of the American Societyfor Information Science 37.3–11.

GRIMES, J. 1975.The thread of discourse. TheHague:Mouton.

GROSZ, BARBARA J. 1986.Therepresentationanduseof focusin asystemfor understand-ing dialogs. In Readings in natural language processing, ed. by BarbaraJ. Grosz,KarenSparckJones,& BonnieLynn Webber, 353–362.MorganKaufmann.

——, & CANDACE L. SIDNER. 1986. Attention, intention,andthestructureof discourse.Computational Linguistics 12.172–204.

HAHN, UDO. 1990.Topicparsing:Accountingfor text macrostructuresin full-text analysis.Information Processing and Management 26.135–170.

HALLID AY, M. A. K., & R. HASAN. 1976.Cohesion in English. London:Longman.

HARDT, DANIEL. 1992. An algorithmfor VP ellipsis. In Proceedings of the 30th Meetingof the Association for Computational Linguistics, 9–14.

HARMAN , DONNA. 1993.Overview of thefirst Text REtrieval Conference.In Proceedingsof the 16th Annual International ACM/SIGIR Conference, 36–48,Pittsburgh,PA.

HAYES, PHILLIP J. 1992. Intelligenthigh-volumetext processingusingshallow, domain-specifictechniques.In Text-based intelligent systems: Current research and practicein information extraction and retrieval, ed. by Paul S. Jacobs,227–242.LawrenceErlbaumAssociates.

HEARST, MARTI A. 1992.Automaticacquisitionof hyponymsfrom largetext corpora.InProceedings of the Fourteenth International Conference on Computational Linguistics,539–545,Nantes,France.

——. 1993. TextTiling: A quantitative approachto discoursesegmentation. Techni-cal ReportSequoia93/24,ComputerScienceDepartment,University of California,Berkeley.

——, & CHRISTIAN PLAUNT. 1993. Subtopicstructuringfor full-length documentaccess.In Proceedings of the 16th Annual International ACM/SIGIR Conference, 59–68,Pittsburgh,PA.

——, & HINRICH SCHUTZE. 1993. Customizinga lexicon to bettersuit a computationaltask. In Proceedings of the SIGLEX Workshop on Acquisition of Lexical Knowledgefrom Text, 55–69,Columbus,OH.

HENZLER, ROLF G. 1978. Freeor controlledvocabularies: Somestatisticaluser-orientedevaluationsof biomedicalinformationsystems.International Classification 5.21–26.

Page 11: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 119

FARLEY, LAINE. 1989. Dissectingslow searches.University of California Division ofLibrary Automation Bulletin (DLA) 9.

FILLMORE, CHARLES J. 1981. Pragmaticsandthe descriptionof discourse.In Radicalpragmatics, ed.by PeterCole.New York: AcademicPressInc.

FISHER, DAVID , 1994. Topic characterizationof full lengthtexts usingdirectandindirecttermevidence.MastersReport,Universityof California,Berkeley, to appear.

FOWLER, RICHARD H., WENDY A. L. FOWLER, & BRADLEY A. WILSON. 1991.Integratingquery, thesaurus,anddocumentsthroughacommonvisualrepresentation.In Proceed-ings of the 14th Annual International ACM/SIGIR Conference, 142–151,Chicago.

FOX, EDWARD A., & MATTHEWB.KOLL. 1988.PracticalenhancedBooleanretrieval: Expe-rienceswith theSMART andSIREsystems.Information Processing and Management24.

FRUCHTERMANN, T., & E.RHEINGOLD. 1990.Graphdrawing by force-directedplacement.TechnicalReportUIUCDCS-R-90-1609,Departmentof ComputerScience,Universityof Illinois, Urbana-Champagne,Ill.

FUHR, NORBERT, & CHRIS BUCKLEY. 1993. Optimizing documentindexing andsearchterm weightingbasedon probabilisticmodels. In The first text retrieval conference(TREC-1), ed.by DonnaHarman,89–100.NIST SpecialPublication500-207.

FULLER, MICHAEL, ERIC MACKIE, RON SACKS-DAVIS, & ROSSWILKINSON. 1993.Coherentanswersfor a largestructureddocumentcollection.In Proceedings of the 16th AnnualInternational ACM/SIGIR Conference, 204–213,Pittsburgh,PA.

FUNG, ROBERT M., STUART L. CRAWFORD, LEE A. APPELBAUM, & RICHARD M. TONG.1990. An architecturefor probabilisticconcept-basedinformationretrieval. In Pro-ceedings of the 13th International ACM/SIGIR Conference, 455–467.

GALE, WILLIAM A., KENNETH W. CHURCH, & DAVID YAROWSKY. 1992a. Estimatingupperandlowerboundson theperformanceof word-sensedisambiguationprograms.In Proceedings of the 30th Meeting of the Association for Computational Linguistics,249–256.

——, KENNETH W. CHURCH, & DAVID YAROWSKY. 1992b. A methodfor disambiguatingwordsensesin a largecorpus.Computers and the Humanities 5-6.415–439.

——, KENNETH W. CHURCH, & DAVID YAROWSKY. 1992c. Onesenseper discourse.InProceedings of the DARPA Speech and Natural Language Workshop.

GIRILL , T. R. 1991.Informationchunkingasaninterfacedesignissuefor full-text databases.In Interfaces for information retrieval and online systems, ed.by Martin Dillon, 149–158.New York, NY: GreenwoodPress.

Page 12: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 118

COOPER, WILLIAM S. 1969.Is interindexerconsistency ahobgoblin?American Documen-tation 20.268–278.

——, FREDRIC C. GEY, & AITOA CHEN. 1994. Probabilisticretrieval in the TIPSTERcollections:An applicationof stagedlogistic regression.In Proceedings of TREC-2,ed.by DonnaHarman.

CROFT, W. BRUCE, ROBERT KROVETZ, & H. TURTLE. 1990.Interactiveretrievalof complexdocuments.Information Processing and Management 26.593–616.

——, & HOWARD R. TURTLE. 1992.Text retrieval andinference.In Text-based intelligentsystems: Current research and practice in information extraction and retrieval, ed.byPaulS.Jacobs,127–156.LawrenceErlbaumAssociates.

CROUCH, C. J. 1990. An approachto the automaticconstructionof global thesauri.Information Processing and Management 26.629–640.

CUTTING, DOUGLAS R., DAVID KARGER, & JAN PEDERSEN. 1993. Constantinteraction-time Scatter/Gatherbrowsingof very largedocumentcollections. In Proceedings ofthe 16th Annual International ACM/SIGIR Conference, 126–135,Pittsburgh,PA.

——, JAN O. PEDERSEN, DAVID KARGER, & JOHN W. TUKEY. 1992. Scatter/Gather:Acluster-basedapproachto browsinglargedocumentcollections.In Proceedings of the15th Annual International ACM/SIGIR Conference, 318–329,Copenhagen,Denmark.

CUTTING, DOUGLASS R., JAN O. PEDERSEN, PER-KRISTIAN HALVORSEN, & MEG WITH-GOTT. 1990.Informationtheaterversusinformationrefinery. In AAAI Spring Sympo-sium on Text-based Intelligent Systems, ed.by Paul S.Jacobs.

DAGAN, IDO, SHAUL MARCUS, & SHAUL MARKOVITCH. 1993.Contextualwordsimilarityandestimationfrom sparsedata. In Proceedings of the 31th Annual Meeting of theAssociation for Computational Linguistics, 164–171.

DALRYMPLE, MARY, STUART M. SHEIBER, & F PEREIRA. 1991. Ellipsis andhigher-orderunification.Linguistics and Philosophy 14.399–452.

DAVIS, JAMES R. 1994.A server for adistributeddigital technicalreportlibrary. TechnicalReportComputerScienceReportNumber94-1418,CornellUniversity.

DE TOCQUEVILLE, ALEXIS. 1835. Democracy in America, Volume I. London: SaundersandOtley.

DEERWESTER, SCOTT, SUSAN T. DUMAIS, GEORGEW. FURNAS, THOMAS K. LANDAUER,& RICHARD HARSHMAN. 1990. Indexing by latentsemanticanalysis.Journal of theAmerican Society for Information Science 41.391–407.

DEJONG, GERALD F. 1982. An overview of the FRUMP system. In Strategies for nat-ural language processing, ed. by WendyG. Lehnert& Martin H. Ringle, 149–176.Hillsdale: Erlbaum.

Page 13: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

BIBLIOGRAPHY 117

BRENT, MICHAEL R. 1991. Automaticacquisitionof subcategorizationframesfrom un-tagged,free-text corpora.In Proceedings of the 29th Annual Meeting of the Associationfor Computational Linguistics.

BROWN, GILLIAN , & GEORGEYULE. 1983. Discourse analysis. CambridgeTextbooksinLinguisticsSeries.CambridgeUniversityPress.

BUCKLEY, CHRIS, JAMES ALLAN , & GERARDSALTON. 1994.Automaticroutingandad-hocretrieval usingSMART: TREC2. In Proceedings of TREC-2, ed.by DonnaHarman.To appear.

CALZOLARI , NICOLETTA, & REMO BINDI. 1990.Acquisitionof lexical informationfrom alargetextual italiancorpus.In Proceedings of the Thirteenth International Conferenceon Computational Linguistics, Helsinki.

CARDIE, CLAIRE. 1992. Corpus-basedacquisitionof relative pronoundisambiguationheurisitics.In Proceedings of the 30th Meeting of the Association for ComputationalLinguistics, 216–223.

CHAFE, WALLA CE L. 1979. The flow of thoughtand the flow of language. In Syntaxand semantics: Discourse and syntax, ed. by Talmy Givon, volume 12, 159–182.AcademicPress.

CHALMERS, MATTHEW, & PAUL CHITSON. 1992. Bead: Explorationin informationvisu-alization. In Proceedings of the 15th Annual International ACM/SIGIR Conference,330–337,Copenhagen,Denmark.

CHARNIAK , EUGENE. 1983.Passingmarkers:a theoryof contextual influencein languagecomprehension.Cognitive Science 7.171–190.

CHEN, FRANCINER.,& MARGARETWITHGOTT. 1992.Theuseof emphasistoautomaticallysummarizeaspokendiscourse.In Proceedings of ICASSP.

CHURCH, KENNETH W., & PATRICK HANKS. 1989. Word associationnorms, mutualinformation,and lexicography. In Proceedings of the 27th Annual Meeting of theAssociation for Computational Linguistics, 76–83.

——, & PATRICK HANKS. 1990.Wordassociationnorms,mutualinformation,andlexicog-raphy. American Journal of Computational Linguistics 16.22–29.

——, & MARK Y. LIBERMAN. 1991.A statusreporton theACL/DCI. In The Proceedingsof the 7th Annual Conference of the UW Centre for the New OED and Text Research:Using Corpora, 84–91,Oxford.

——, & ROBERT L. MERCER. 1993. Introductionto the specialissueon computationallinguisticsusinglargecorpora.Computational Linguistics 19.1–24.

COCHRAN, W. G. 1950. Thecomparisonof percentagesin matchedsamples.Biometrika37.256–266.

Page 14: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

116

Bibliography

ABOUD, M., C. CHRISMENT, R. RAZOUK, & F. SEDES. 1993. Querying a hypertextinformationretrieval systemby theuseof classification.Information Processing andManagement 29.387–396.

AL-HAWAMDEH, S.,R. DEVERE, G. SMITH, & P. WILLETT. 1991. Usingnearest-neighborsearchingtechniquesto accessfull-text documents.Online Review 15.173–191.

ALSHAWI, HIYAN. 1987.Processingdictionarydefinitionswith phrasalpatternhierarchies.American Journal of Computational Linguistics 13.195–202.

ALTERMAN, RICHARD, & LARRY A. BOOKMAN. 1990.Somecomputationalexperimentsinsummarization.Discourse Processes 13.143–174.

AMIR, ELAN , 1993. Carta: A network topologypresentationtool. ProjectReport,UCBerkeley.

ARENTS, H. C., & W. F. L. BOGAERTS. 1993. Concept-basedretrieval of hypermediainformation– from termindexing to semantichyperindexing. Information Processingand Management 29.373–386.

BACHENKO, JOAN, EILEEN FITZPATRICK, & C.E.WRIGHT. 1986.Thecontributionof parsingto prosodicphrasingin anexperimentaltext-to-speechsystem.In Proceedings of the24th Annual Meeting of the Association for Computational Linguistics, 145–155.

BAREISS, RAY. 1989. Exemplar-based knowledge acquisition. Perspectivesin ArtificialIntelligence.AcademicPress,Inc.

BATALI , JOHN, 1991. Automatic acquisition and use of some of the knowledge in physicstexts. MassachusettsInstituteof Technology, Artificial IntelligenceLaboratorydisser-tation.

BATES, MARCIA J. 1986.Subjectaccessin onlinecatalogsa designmodel. Journal of theAmerican Society for Information Science 37.

BELL, JOHN E., & LAWRENCE A. ROWE. 1990. Humanfactorsevaluationof a textual,graphical,and natural languagequery interfaces. TechnicalReport M90/12, UCBerkeley ERL.

BERTIN, JACQUES. 1983.Semiology of graphics. Madison,WI: TheUniversityof WisconsinPress.Translatedby William J.Berg.

Page 15: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

APPENDIXA. TOCQUEVILLE,CHAPTER1 115

observablein savagelife: the Indians,althoughtheyareignorantandpoor, areequalandfree.

24 WhenEuropeansfirst cameamongthem,thenativesof NorthAmericawere ignorantof the valueof riches,and indifferent tothe enjoymentsthat civilized man procuresfor himself by theirmeans.Neverthelesstherewasnothingcoarsein their demeanor;theypracticedhabitualreserveandakindof aristocraticpoliteness.

25 Mild andhospitablewhenat peace,thoughmercilessin warbeyondany known degreeof humanferocity, the Indian wouldexposehimself to die of hungerin order to succorthe strangerwho askedadmittanceby night at thedoorof hishut; yethecouldtearin pieceswith this handsthe still quiveringlimbs of his pris-oner. The famousrepublicsof antiquity never gave examplesofmoreunshakencourage,morehaughtyspirit, or moreintractableloveof independencethanwerehiddenin formertimesamongthewild forestsof theNew World. TheEuropeansproducedno greatimpressionwhenthey landeduponthe shoresof North America;their presenceengenderedneitherenvy nor fear. What influencecouldtheypossessoversuchmenasI havedescribed?TheIndiancould live without wants,suffer without complaint,andpour outhisdeath-songatthestake.Like all theothermembersof thegreathumanfamily, thesesavagesbelieved in the existenceof a betterworld, and adored,underdifferent names,God, the Creator, oftheuniverse.Their notionson thegreatintellectualtruthswereingeneralsimpleandphilosophical.

DOC SEGMENT1026 Although we have heretracedthe characterof a primitive

people,yetit cannotbedoubtedthatanotherpeople,morecivilizedand more advancedin all respects,had precededit in the sameregions.

27 An obscuretraditionwhichprevailedamongtheIndiansonthebordersof the Atlantic informs us that thesevery tribesformerlydwelt on thewestsideof theMississippi.Along the banksof theOhio,andthroughoutthecentralvalley, therearefrequentlyfound,at thisday, tumuli raisedby thehandsof men.On exploringtheseheapsof earthto theircenter, it is usualto meetwith humanbones,strangeinstruments,armsandutensilsof all kinds,madeof metal,anddestinedfor purposesunknownto thepresentrace.TheIndiansof ourtimeareunabletogiveanyinformationrelativeto thehistoryof thisunknownpeople.Neitherdid thosewholivedthreehundredyearsago,whenAmericawasfirst discovered,leave anyaccountsfrom which evena hypothesiscouldbeformed. Traditions,thoseperishableyet ever recurrentmonumentsof the primitive world,do not provideanylight. There,however, thousandsof our fellowmenhave lived; onecannotdoubt that. Whendid theygo there,whatwastheir origin, their destiny, their history? Whenandhowdid theydisappear?No onecanpossiblytell.

28 How strangeit appearsthatnationshaveexistedandafterwardssocompletelydisappearedfrom theearththatthememoryevenoftheir namesis effaced! Their languagesare lost; their glory isvanishedlike a soundwithout an echo; thoughperhapsthereisnot onewhich hasnot left behindit sometomb in memoryof itspassage.Thusthemostdurablemonumentof humanlaboris thatwhichrecallsthewretchednessandnothingnessof man.

DOC SEGMENT1129 Although the vastcountry that I have beendescribingwas

inhabitedby manyindigenoustribes,it may justly be said,at thetimeof itsdiscoverybyEuropeans,tohaveformedonegreatdesert.The Indiansoccupiedwithout possessingit. It is by agriculturallabor that manappropriatesthe soil, and the early inhabitantsofNorthAmericalivedby theproduceof thechase.Their implacableprejudices,their uncontrolledpassions,their vices,andstill more,perhaps,theirsavagevirtues,consignedthemto inevitabledestruc-

tion. Theruin of thesetribesbeganfrom thedaywhenEuropeanslandedon their shores;it hasproceededever since,and we arenow witnessingits completion.Theyseemto havebeenplacedbyProvidenceamid therichesof the New World only to enjoy themfor aseason;theyweretheremerelytowait till otherscame.Thosecoasts,so admirablyadaptedfor commerceand industry; thosewide anddeeprivers;that inexhaustiblevalleyof theMississippi;thewholecontinent,in short,seemedpreparedto betheabodeofagreatnationyetunborn.

30 In that landthe greatexperimentof theattemptto constructsocietyupon a new basiswasto be madeby civilized man; andit wasthere,for the first time, that theorieshithertounknown, ordeemedimpracticable,wereto exhibit a spectaclefor which theworld hadnot beenpreparedby thehistoryof thepast.

Page 16: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

APPENDIXA. TOCQUEVILLE,CHAPTER1 114

andbruisedagainstthe neighboringcliffs, wereleft scatteredlikewrecksat their feet.

11 The valley of the Mississippi is, on the whole, the mostmagnificentdwelling-placepreparedby Godfor man’sabode;andyet it maybesaidthatatpresentit is but a mightydesert.

12 On the easternsideof the Alleghenies,betweenthebaseofthesemountainsandtheAtlantic Ocean,lies a long ridgeof rocksandsand,which the seaappearsto have left behindas it retired.Theaveragebreadthof this territory doesnot exceed48 leagues;but it is about300 leaguesin length. This part of the Americancontinenthasa soil thatoffersevery obstacleto thehusbandman,andits vegetationis scantyandunvaried.

13 Uponthis inhospitablecoastthefirst unitedefforts of humanindustryweremade. This tongueof arid land wasthe cradleofthoseEnglishcolonieswhichweredestinedonedayto becometheUnitedStatesof America. Thecenterof powerstill remainshere;while to thewestof it thetrueelementsof thegreatpeopletowhomthe future control of the continentbelongsaregatheringtogetheralmostin secrecy.

DOC SEGMENT514 WhenEuropeansfirst landedontheshoresof theWestIndies,

andafterwardson thecoastof SouthAmerica,theythoughtthem-selvestransportedinto thosefabulousregionsof which poetshadsung.Theseasparkledwith phosphoriclight, andtheextraodinarytransparencyof its watersdisclosedto the view of the navigatorall thedepthsof theocean.Hereandthereappearedlittle islandsperfumedwith odoriferousplants,andresemblingbasketsof flow-ersfloatingon thetranquilsurfaceof theocean.Everyobjectthatmetthesight in this enchantingregionseemedpreparedto satisfythe wantsor contribute to the pleasuresof man. Almost all thetreeswere loadedwith nourishingfruits, and thosewhich wereuselessasfood delightedthe eyeby the brillianceandvariety oftheircolors.In grovesof fragrantlemontrees,wild figs,floweringmyrtles,acacias,andoleanders,whichwerehungwith festoonsofvariousclimbingplants,coveredwith flowers,amultitudeof birdsunknownin Europedisplayedtheirbrightplumange,glitteringwithpurpleandazure,andmingledtheirwarblingwith theharmonyofaworld teemingwith life andmotion.

15 Underneaththis brilliant exteriordeathwasconcealed.Butthis fact was not then known, and the air of theseclimateshadan indefinableenervatinginfluence,which mademancling to thepresent,heedlessof thefuture.

16 NorthAmericaappearedunderavery differentaspect:thereeverythingwasgrave,serious,andsolemn;it seemedcreatedto bethedomainof intelligence,astheSouthwasthatof sensualdelight.A turbulentandfoggy oceanwashedits shores.It wasgirt roundby a belt of granitic rocksor by wide tractsof sand. The foliageof its woodswasdarkandgloomy, for theywerecomposedof firs,larches,evergreenoaks,wild olive trees,andlaurels.

DOC SEGMENT617 Beyondthisouterbeltlaythethickshadesof thecentralforests,

wherethelargesttreeswhichareproducedin thetwo hemispheresgrow sideby side.Theplane,thecatalpa,thesugarmaple,andtheVirginianpoplarmingledtheirbrancheswith thoseof theoak,thebeech,andthelime.

18 In these,asin the forestsof theOld World, destructionwasperpetuallygoingon. The ruins of vegetationwereheapedupononeanother;but therewasno laboringhandto removethem,andtheir decaywasnot rapid enoughto makeroomfor the continualwork of reproduction.Climbing plants,grasses,andotherherbsforcedtheirwaythroughtthemassof dying trees;theycreptalongtheirbendingtrunks,foundnourishmentin theirdustycavities,andapassagebeneaththelifelessbark. Thusdecaygaveits assistance

to life, andtheirrespectiveproductionsweremingledtogether. Thedepthsof theseforestsweregloomyandobscure,anda thousandrivulets,undirectedin theircourseby humanindustry, preservedinthema constantmoisture. It wasrareto meetwith flowers,wildfruits, or birdsbeneaththeir shades.Thefall of a treeoverthrownby age,therushingtorrentof a cataract,thelowing of thebuffalo,andthe howling of the wind werethe only soundsthat brokethesilenceof nature.

19 To the eastof thegreatriver the woodsalmostdisappeared;in their steadwere seenprairies of immenseextent. WhetherNaturein herinfinite varietyhaddeniedthegermsof treesto thesefertile plains,or whethertheyhadoncebeencoveredwith forests,subsequentlydestroyedby the handof man, is a questionwhichneithertraditionnor scientificresearchhasbeenableto answer.

DOCSEGMENT720 Theseimmensedesertswerenot,however,wholly untenanted

by men.somewanderingtribeshasbeenfor agesscatteredamongtheforestshadesor on thegreenpasturesof theprairie. Fromthemouthof theSt. Lawrenceto theDeltaof theMississippi,andfromtheAtlantic to the PacificOcean,thesesavagespossessedcertainpointsof resemblancethat borewitnessto their commonorigin;but at the sametime they differedfrom all otherknown racesofmen;theywereneitherwhite like the Europeans,nor yellow likemostof the Asiatics,nor black like the Negroes.Their skin wasreddishbrown, their hair long andshining,the lips thin, andtheircheekbonesvery prominent. The languagesspokenby the NorthAmericantribeshaddifferentvocabularies,butall obeyedthesamerulesof grammar. Theserulesdifferedin severalpointsfrom suchashadbeenobservedto governtheorigin of language.Theidiomof the Americansseemedto betheproductof new combinations,andbespokeaneffort of theunderstandingof whichtheIndiansofourdayswouldbeincapable.

DOCSEGMENT821 Thesocialstateof thesetribesdifferedalsoin manyrespects

from all thatwasseenin theOld World. Theyseemto havemulti-pliedfreely in themidstof theirdeserts,withoutcomingin contactwith otherracesmorecivilized thantheir own. Accordingly, theyexhibitednoneof thoseindistinct, incoherentnotionsof right andwrong,noneof thatdeepcorruptionof manners,which is usuallyjoinedwith ignoranceandrudenessamongnationswho,afterad-vancingtocivilization,haverelapsedintoastateof barbarism.TheIndianwasindebtedto no onebut himself; his virtues,his vices,andhisprejudiceswerehisownwork; hehadgrown upin thewildindependenceof hisnature.

DOCSEGMENT922 If in polishedcountriesthelowestof thepeoplearerudeand

uncivil, it is notmerelybecausetheyarepoorandignorant,but be-cause,beingso,theyarein daily contatwith rich andenlightenedmen. The sightof their own hardlot andtheir weakness,whichis daily contrastedwith thehappinessandpowerof someof theirfellow creatures,excitesin their heartsat thesametime thesenti-mentsof angerandof fear: the consciousnessof their inferiorityandtheir dependenceirritateswhile it humiliatesthem. This stateof mind displaysitself in their mannersandlanguage;theyareatonceinsolentand servile. The truth of this is easily provedbyobservation: thepeoplearemorerudein aristocraticcountiesthanelsewhere;in opulentcitiesthanin rural districts. In thoseplaceswheretherich andpowerful areassembledtogether, theweakandtheindigentfeel themselvesoppressedby their inferior condition.Unableto perceiveasinglechanceof regainingtheirequality, theygiveupto despairandallow themselvesto fall below thedignity ofhumannature.

23 This unfortunateeffect of the disparityof conditionsis not

Page 17: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

113

Appendix A

Tocqueville, Chapter 1

Thetext of Democracy in America, byAlexisdeTocqueville, 1835,Volume1 Chapter1.

DOC SEGMENT11 North America presentsin its externalform certaingeneral

featureswhich it is easyto distinguishat thefirst glance.2 A sortof methodicalorderseemsto haveregulatedthesepara-

tion of landandwater, mountainsandvalleys.A simplebut grandarrangementis discoverableamidtheconfusionof objectsandtheprodigiousvarietyof scenes.

3 Thiscontinentis almostequallydividedinto two vastregions.One is boundedon the north by the Arctic Pole,andon the eastandwestby the two greatoceans.It stretchestowardsthe south,forming a triangle,whoseirregularsidesmeetat lengthabovethegreatlakesof Canada.Thesecondregionbeginswheretheotherterminates,andincludesall the remainderof the continent. TheoneslopesgentlytowardsthePole,theothertowardstheEquator.

4 Theterritory includedin thefirst regiondescendstowardsthenorth with a slopeso imperceptiblethat it may almostbe saidtoform a plain. Within the boundsof this immenselevel tract thereare neitherhigh mountainsnor deepvalleys. Streamsmeanderthroughit irregularly; greatrivers intertwine,separate,andmeetagain,spreadinto vastmarshes,losing all traceof their channelsin the labyrinthof waterstheyhave themselvescreated,andthusat length,afterinnumerablewindings,fall into thePolarseas.Thegreatlakeswhichboundthisfirst regionarenotwalledin, like mostof thosein the Old World, betweenhills androcks. Their banksareflat andrisebut a few feetabovethelevel of theirwaters,eachthusforming a vastbowl filled to the brim. Theslightestchangein thestructureof theglobewouldcausetheirwatersto rusheithertowardsthePoleor to thetropicalseas.

DOC SEGMENT25 The secondregionhasa more brokensurfaceand is better

suitedfor the habitationof man. Two long chainsof mountainsdivide it, from oneto theother:one,namedtheAllegheny, followsthedirectionof theshoreof theAtlanticOcean;theotheris parallelwith thePacific.

6 The spacethat lies betweenthesetwo chainsof mountainscontains228,843squareleagues.Its surfaceis thereforeaboutsixtimesasgreatasthatof France.

7 This vast territory, however, forms a singlevalley, onesideof which descendsfrom theroundedsummitsof the Alleghenies,while the otherrisesin anuninterruptedcourseto the topsof theRockyMountains.At thebottomof the valleyflows an immenseriver, intowhichyoucansee,flowing fromall directions,thewatersthat comedown from the mountains. In memoryof their native

land, the Frenchformerly called this river the St. Louis. TheIndians,in their pompouslanguage,have namedit the FatherofWaters,or theMississippi.

DOCSEGMENT38 The Mississippitakesits sourceat the boundaryof the two

great regionsof which I have spoken,not far from the highestpoint of theplateauthatseparatesthem. Nearthesamespotrisesanotherriver, which emptiesinto thePolarseas.Thecourseof theMississippiis at first uncertain:it windsseveraltimestowardsthenorth,whenceit rose,andonly at length,afterhaving beendelayedin lakesandmarshes,doesit assumeits definitedirectionandflowslowly onwardto thesouth.

9 Sometimesquietly gliding along the chalky bed that naturehasassignedto it, sometimesswollenby freshets,theMississippiwatersover 1,032leaguesin its course. At the distanceof 600leaguesfromitsmouththisriverattainsanaveragedepthof 15feet;andit is navigatedby vesselsof 300tonsfor acourseof nearly200leagues.Onecounts,amongthetributariesof theMississippi,oneriver of 1,300leagues,oneof 900,oneof 600,oneof 500,four of200,not to speakof a countlessmultitide of small tramsthatrushfrom all directionsto minglein its flow.

DOCSEGMENT410 Thevalleywhich is wateredby theMississippiseemsto have

beencreatedfor it alone,andthere,like agodof antiquity, theriverdispensesbothgoodandevil. Nearthestreamnaturedisplaysaninexhaustiblefertility; thefartheryougetfrom its banks,themoresparsethevegetation,thepoorerthesoil, andeverythingweakensor dies.Nowherehavethegreatconvulsionsof theglobeleft moreevident tracesthan in the valley of the Mississippi. The wholeaspectof the country shows the powerful effects of water, bothby its fertility andby its barrenness.Thewatersof the primevaloceanaccumulatedenormousbedsof vegetablemold in thevalley,whichtheyleveledastheyretired.Upontheright bankof theriverare found immenseplains, as smoothas if the tiller hadpassedover them with his roller. As you approachthe mountains,thesoil becomesmoreandmoreunequalandsterile; the groundis,asit were,piercedin a thousandplacesby primitive rocks,whichappearlike thebonesof askeletonwhosefleshhasbeenconsumedby time. Thesurfaceof the earthis coveredwith a granitic sandand irregular massesof stone,amongwhich a few plants forcetheirgrowth andgive theappearanceof a greenfield coveredwiththe ruins of a vastedifice. Thesestonesand this sanddisclose,on examination,a perfectanalogywith thosethat composethearid andbrokensummitsof the RockyMountains. The flood ofwaterswhichwashedthesoil to thebottomof thevalleyafterwardscarriedawayportionsof the rocksthemselves;andthese,dashed

Page 18: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER6. CONCLUSIONSAND FUTUREWORK 112

Theexamplesanddiscussionin this dissertationsuggestthatsucha collectionshouldbecognizantof issuesrelatingto termdistribution: relevancejudgmentsshouldindicatewhattopicalor distributionalrole thequerytermsareto playwithin theretrieveddocuments.

Page 19: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER6. CONCLUSIONSAND FUTUREWORK 111

vary dependingon the query. This supportsmy conjecturethat the criteria uponwhichrankingis basedshouldbe shown explicitly in sucha way that userscanmakeinformeddecisionsaboutwhich documentsto view. It alsolendspartial supportto my hypothesesaboutthemeaningsof variouspatternsof termdistribution.

I havealsodescribedanew interface,calledCougar, thatallowsusersto view retrieveddocumentsaccordingto multiplemaintopicassignments.Cougarhasanappealinginterac-tivecomponentthatallowsusersto seethemaintopiccontext in whichretrieveddocumentsareused,providing themwith informationthatcanbemissingin theTileBardisplay. Bothdisplaytoolsneedto beintegratedintooneinterface,alongwith othertext analysisfacilitiesto allow fastassimilationof thecontentsof retrievedinformation. Together, therepresen-tationof maintopicandsubtopicstructuralinformationprovidesapowerful new paradigmfor interpretingtheresultsof queriesagainstfull-text collections.

TheTextTiling work introducesa new granularityof analysisfor thetext segmentationtask. Evidencefor the usefulnessof multi-paragraphsegmentsis increasing,despitethefact that this is a nontraditionaldiscourseunit. Aside from theusein informationaccessdescribedhere, other potential applicationsof multi-paragraphsegmentsare multiple-window text displaysand text window identificationfor corpus-basednatural languageprocessingalgorithms.

This work shouldbeextendedin severaldirections.First, I planto moreformally testsomeof the embellishmentsto the TextTiling algorithm. I am particularly interestedintrying differentways to integratethesauraltermsinto the algorithm(perhapssimply byusinga goodonlinethesaurus,shouldonebecomegenerallyavailable). I would alsoliketo improve theresultsin thecasesin which theevidencefor oneparagraphboundaryoveranotheris weak(e.g.,whena valley in the similarity scoreplot falls within a paragraph,or whentwo shortparagraphsareadjacentto oneanotherneara valley). Oneapproachisto try simplediscoursecues;anotheris to makea morelocalizedanalysisof termoverlapwhentheboundarychoiceis unclear. Still anotheralternative is to find a way to expresstile overlap,especiallywhentransitionstakesplacemid-paragraph.

I would alsolike to formally compareTextTiles to paragraphsin sometask. If tilesperformbetter, or for that matter, no worsethanparagraphsin informationaccesstasks,then tiles are preferablefor the simple reasonthat they are lessexpensive to storeandprocesssimply becausetherearefewer tiles thanparagraphsper document(if positionalinformationwithin tiles or paragraphsis not important).

I wouldalsolike toformallyevaluateTileBarsin termsof theirusein relevancefeedbackandwith respectto how usersinterpretthemeaningof thetermdistributions.Theanalysiscouldcompareuser’s expectationsaboutthemeaningof thetermdistributionsagainsttheanalysisshown in thechartof Chapter3. It maybeusefulto determinein whatsituationsthe users’expectationsarenot met, in hopesof identifying what additionalinformationshouldbeaddedin orderto preventmisconceptions.

Both displaymechanismsdescribedhereshouldbe extendedto work with texts thatalreadydohave somehierarchicalstructurebuilt in.

Theinformationaccesscommunityneedsto developapassageretrieval testcollection.

Page 20: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

110

Chapter 6

Conclusions and Future Work

In this dissertationI have introducednew waysto view andanalyzethe structureoffull-text documents.I haveinvestigatedtheroleof contextualinformationin theautomatedretrieval anddisplayof full-text documents,usingcomputationallinguisticsalgorithmstoautomaticallydetectstructurein andassigntopic labelsto texts. I have shown how, forthepurposesof informationaccess,full texts arequalitatively differentfrom abstracts,andhave suggestedthat as a consequence,full text requiresnew approachesto informationaccess.As a first step,I have suggestedthe examinationof patternsof term distributionin long texts, andhave shown how thesepatternsareusefulboth for recognizingsubtopicstructureandfor describingthe resultsof a query. I have alsoargued,following Cuttinget al. (1990),that themechanismsfor queryinganddisplayingdocumentsshouldreceiveasmuchattentionastheretrieval algorithm,andthatall threecomponentsshouldmutuallyreinforceoneanother.

I have describedanalgorithm,calledTextTiling, thatuseslexical frequency anddistri-bution to identify thesubtopicstructureof expositorytexts. Thecurrentlymostsuccessfulversionof this algorithmrequiresonly a shortstoplistanda morphologicalanalyzerandanalyzesabout20megabytesof text anhour(includingtokenization).I havealsodescribedan algorithm that assignsmultiple main topic categories to long texts. This algorithmrequiresapre-definedcategorysetanda trainingrunbut doesnotrequirepre-labeledtexts,which are much harderto comeby than category sets. Both algorithmsare comparedagainstreaderjudgmentsandarefoundto performwell, althoughnot flawlessly, on theseapproximatetasks.

I have also presenteda framework for the interpretationof query term distributionpatternswithin full text documents.Thisanalysisleadsto anew interfaceparadigm,calledTileBars,thatprovidesa compactandinformativeiconic representationof thedocuments’contentswith respectto thequeryterms.TileBarsallow usersto makeinformeddecisionsaboutnotonlywhichdocumentstoview, butalsowhichpassagesof thosedocuments,basedonthedistributionalbehavior of thequerytermsin thedocuments.I havedemonstratedtheuseof TileBarsin ananalysisof someof theTRECqueries(Harman1993). In thecourseof this analysisI showedthatthepatternsof termdistributionsfor relevantdocumentscan

Page 21: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 109

5.4 Conclusions

A full-fledgedinformationaccesssystemshouldconsistof several differenttools forqueryformulation,documentindexing, datasetselection,andcharacterizationof retrievalresults.I havedescribedaninformationaccesssituation– passageretrieval from full-lengthtexts– in whichuserscanbenefitfrom aninterfacethatdisplaysinformationaboutthemaintopiccontexts of theretrieveddocuments.

Usersrequestingpassagesfrom long texts do not know the contexts from which thepassageswereextracted. Existingapproacheseither(i) show how similar documentsareto oneanotheror thequery, or (ii) requireusersto specifytermsor attributesto organizethe resultingdocumentsaround. I have describedproblemswith both approachesandsuggestedthat retrieval resultsbe displayedin termsof multiple independentattributesthatcharacterizethemaintopicsof the texts, andthat thesystemvolunteerdisplayof therelevantattributes,ratherthanrequiretheuserto guessthem.

Theattributesor categoriescanvarydependingonwhatkind of informationisavailableand/orappropriatefor thecorpus.I havesuggestedassigningcategoriesthatcharacterizethemaintopicsof longtexts,andhavedescribedanalgorithmthatcandosowith somedegreeof successwithout requiring pre-labeledtexts. I anticipateimprovementin automatedcategoryassignmentalgorithmsin future.

A consequenceof allowing multiplecategoriesto beassignedto documentsis thattheymakethedisplayproblemamulti-dimensionalone.To handlethis, I suggestamechanismthatgivestheusersomecontrolover which categoriesareat the focusof attentionat anygiventime,andasimplewayto seehow theretrieveddocumentsarerelatedto oneanotherwith respectto thesecategories.

I have implementeda prototypeof this displayparadigm;it illustratesthemainpointsbehindthe ideaspresentedherealthoughuserevaluationstudiesremainto be done. Infuture I plan to incorporatethesemechanismsinto an interfacefor for queryingagainstsubtopicstructure,andfor allowing queriesto specifysubtopictermswith respectto maintopiccategories,like thatdescribedin Section3.

AlthoughI have not formally evaluatedtheCougardisplay, anecdotaluserreactionispositive. Usersfind appealingtheability to switchamongthecategoryassignmentsandseethe resultingtopic intersections.Whenthe topic assignmentsareincorrect,however, thetool is probablyworsethanno tool at all. Furthermore,thehighest-rankedcategoriesfor adocumentarethosethataresimilar in meaning.Thisdetractsfrom thegoalof showing theinteractionof thedisparatemaintopics.Thusanimprovedcategory assignmentalgorithmshouldimprove theappealof thetool.

Page 22: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 108

advancea set of terms that characterizedocumentsfrom a collection of bibliographicrecords. Whenthe userissuesa query, the systemretrievesdocumentsthat containthetermsof the query (restrictingthe numberof documentsthat are displayedat any onetime). Additional termsthatarestronglyassociatedwith theretrieveddocumentsarealsoretrieved.Thesystemdisplaysthreerowsof nodescorrespondingtotheassociatedterms,thedocuments,andtheauthorsof thedocuments,respectively. Thetermnodesareconnectedto thedocumentnodesvia edgelinks, sotheusercanseewhich documentsareassociatedwith eachimportantterm. Only thosetermsrelevantto theretrieveddocumentsareshown,althoughthedocumentsretrievedareinfluencedto someextentby which associatedtermsareretrieved. Figure5.6is asketchof theinterface’soutputwhenpresentedwith thequery((:TERM “ASSOCIATIVE”)(:A UTH “ANDERSON,J.A.”)).

PARALL

EL

MEM

ORY

ASS

OCIA

TI

MODEL

INFO

RM

ATI

NET

WORK

PROCES

SIN

SELE

CTI

V

NEU

RON

ANDER

SON

HIN

TON

MOZER

BAR99 WIC69 WIL81 FIN79 AND81 HIN84 AND81 BAR81 KOH81

CATE

GORIZ

Figure5.6: A sketchof theAIR systeminterface(Rose& Belew 1991).

The AIR interfacediffers from that suggestedhere in that it is not gearedtowarddisplayingsubsetsof interactingattributes. For this reason,it appearsthat if therearea large numberof links betweenassociatedtermsanddocuments,or if the links arenotneatlyorganized,the relationshipswill be difficult to discern. Furthermore,categorizinginformationisnotgearedtowardcharacterizingfull-text documents.However, theapproachpresentedheremight benefit by incorporatingan option to display the categories anddocumentsin asimilarmanner.

Similarly, ratherthanusingaVenndiagramdisplay, thefour-attributeInfoCrystal(Spo-erri 1993)mightbeausefulalternative,appliedassuggestedhereto displaysubsetsof therelevantcategories.

Page 23: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 107

many differentcontexts.Document42,at theintersectionof governmentandweapons,discussesa government

proposaltocleanupanuclearweaponsproductioncomplex. Documents8,26,and44,attheintersectionof physicsandweapons,discussthereopeningof aplutoniumprocessingplant,obstaclesto the developmentof orbiting nuclearreactors,andmodernizationof nuclearreactors.Document13describesanuclearwasteleak,document38 therisksof thelaunchof asatellitecontainingplutonium,anddocument30discussestheReaganadministration’srecordin treatingtheozonelayer.

Onearticle labeledwith ships, bodies of water, andnature describestheeffectsof anoil spill on birdlife. Articles labeledwith the food category includetwo aboutanincidentof cyanidepoisoningin yogurt. Notethat if a userwereinterestedin documentsthat talkaboutcontaminationin food,in orderto discover thisarticleusingkeywordsalone,theuserwould have hadto specifyall food termsof interest.However, with appropriatecategoryinformationthis is notnecessary.

Categories to Determine Relevance of Keywords

In the next example,only eight of the top fifty retrieved documentsin responseto aqueryonthewordcattlearelabeledwith thehigher-levelcategorythatcorrespondsto cattle(herd animals). Most of thosethatarenot labeledwith herd animals areaboutfinancialmattersrelating to cropsand foods (e.g., crop futures). Two of thosethat are labeledwith herd animals, whenintersectedwith meat describecattlein therole of livestock,thethird describesacattledrive,andthefourth,whoseothercategory labelsarecountries andbodies of water, hasonly apassingreferenceto cattleandreallydescribesamurderrelatedto landownershipof tropicalrainforests.

By contrast,retrieving on the word cow resultsin articlesabout land disputeswithNative Americans(at theintersectionof government, herd animals, andlegal system) andgrazingfees. Onedocumentthatis not labeledwith herd animals but insteadwith crime,weapons, anddefense, hasonly apassingreferenceto cowsandis abouta robbery.

Thusthecategoriescanbeusedto show whetheror not a searchtermis actuallywell-representedin atext. If thetext isnotassignedthecategorythatthesearchtimeisamemberof, thenthis is astrongindicatorthatthetermis only discussedin passing.

The setof 106 generalcategoriesusedto characterizethe AP datawasderived fromWordNet(Miller et al. 1990)asdescribedin Chapter4. Thealgorithmhasalsobeentrainedon a collectionof computersciencetechnicalreportsusinga setof 11 categoriesderivedfrom a looseinterpretationof theACM ComputingReviewsclassifications.

5.3.4 Discussion

The AIR/SCALIR system(Rose& Belew 1991) hasan interfacethat most closelyincorporatesthe goalsset forth here. The systemallows for very simple queries,andprovidesa kind of contextualizing information. A connectionistnetworkdeterminesin

Page 24: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 106

Figure5.4: TheCougarinterface.

uncoloredandthedisplayeddocumentIDs to disappear. Alternatively, theusercanchooseanadditionalcategory, causinganadditionalring to bepaintedandfilled in with documentIDs. If any of theretrieveddocumentshave beenassignedbothof theselectedcategories,their ID numbersaredisplayedin theappropriateintersectionregion. Onceall threeringshave beenassignedcategories,theusermustunselectonecategory beforeselectinga newone.In this wayuserscaneasilyvarywhich subsetof thecategorysetsis active.

Keywords in Context

Figure 5.4 shows a configurationin which all threecategories have beenselected.Bearing in mind that the documentsretrieved are onesin which the term contaminantappears,wecanexaminethekindof context providedbythecategoryinformation.Themostfrequentlyassignedcategoriesincludefinance, government, meat, legal system, commerce,weapons, food, andvehicles. As thecategoriesimply, discussionsof contaminantsoccurin

Page 25: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 105

share. Note that differentdocumentscanbe groupedtogetherasbeingsimilar basedonwhich categoriesarebeinglookedat. E.g., if onedocumentis aboutthecostof removingcontaminantsfrom foodandanotherthecostof removing contaminantsfrom anecologicaldisaster, whenviewedaccordingto thefinancecategory they haveanintersection,whereasif thefinancecategoryis notselected,thetwo documentsdonotappearto havesimilarities.

This particularcut on how to displayinformationbeginswith a fixedsetof categories,membershipin whichisdesignedtocorrespondtousers’intuitions.Of coursethisapproachis flawed,bothbecausenoonesetof categorychoicesisgoingto fit everydocumentsetandbecauseuserswill have to guesswhatcategorizationaccordingto thetopic really means.Nevertheless,I positthatthisapproachisbetterthanrequiringtheusertoguesswhyagroupof longdocumentshavebeenlabeledasbeingsimilarto oneanother, andbetterthansimplylookingata list of titles rankedby vector-spacebasedsimilarity to thequery. Furthermore,sinceusersdonot have to specifyin advancewhichcategoriesareof interest,they arelesslikely to missinterestingdocumentsjust becausetheir understandingof theclassificationprocedureis inaccurate.

In Cougar, documentsareassignedafixednumberof categoriesfrom apre-determinedset using the automaticcategorizationalgorithmdescribedin Chapter4. In the currentsystemeachdocumentis assignedits threetop-scoringcategories.Thedocumentsarethenindexedonthecategory informationaswell asonall (non-stopword)lexical itemsfrom thetitle andthebody. Indexing andretrieval is currentlydoneusingCornell’s Smartsystem(Salton1971),althoughthis will soonchangeto theindexing structureusedin Chapter3(Section3.4.3).TheinterfacewascreatedusingTcl/Tk (Ousterhout1991).

Two datasetshave beenassignedcategoriesand indexed. The first is a subsetof acollectionof AP news articlestakenfrom monthof 1989from the TIPSTERcollection(Harman1993)andis indexedwith thegeneralcategory setdescribedin Chapter4. Thesecondisacollectionof computersciencetechnicalreports,partof theCNRICS-TRprojectcollection,andis indexedwith computer-relatedcategories.

Usersissuequeriesby enteringwordsor selectingcategoriesfrom anavailablelist. Asmentionedabove,typically theuseronlyentersterminformation.After theuserinitiatesthesearcha list of titles of thetop-scoringdocumentsappears.Thenumberof titlesdisplayedis a parameterthat is set in Smart;currently50 documentsareretrieved at a time. Thetopthreecategoriesfor eachdocumentarealsoretrievedandthemostfrequentlyoccurringof thesearedisplayedin a bankof color-codedbuttonsabove a Venndiagramskeleton.The userselectsup to threeof the categoriesandseeshow the documentsintersectwithrespectto thosecategories.Onecategory canbeunselectedin orderto allow theselectionof another;thedisplayof documentsin theVenndiagramchangesaccordingly.

Morespecifically, theuserselectsoneof thecategoriesby mouse-clickingonacategorybox. Thesystempaintsoneof theVenn-diagramringswith the correspondingcolor andplacesdocumentID numbersthathave beenassignedthiscategory into thepartof theringthat indicatesno intersectionwith othercategories. Clicking on anID numbercausesthecorrespondingtitle to behighlighted,anddouble-clickingbringsup a window containingthedocumentitself. Theusercannow unselectthis category, causingthe ring to become

Page 26: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 104

5.3.2 Displaying Main Topic Categories

Insteadof or alongwith frequentterminformation,category informationcanindicatethecontext in which retrievedpassagesreside.Assigningmultiple independentcategoriesallows for recognizingdifferentinteractionsamongdocuments:two topic categoriesthatarenotusuallyconsideredsemanticallysimilarcanneverthelessbeassociatedwith thesametext if it happensto beaboutbothtopics.

If multiple main topic categoriesareassociatedwith eachtext, userscanbrowsetheresultsof initial querieswith respectto these.Of course,thecategorysetsshouldbetailoredto the text collectionsthey areassignedto. For example,a userinterestedin local areanetworksmighttapintoageneral-interesttestcollection.In thiscase,whentheuserquerieson theword “LAN”, thesystemreturnsgeneralcategories,i.e. technology, finance,legal,etc. If theuseris interestedin, say, the impactof LAN technologyon thebusinessscene,thenthis datasetmaybeuseful.

If ontheotherhandtheuserwantstechnicalinformation,thecontextualizinginformationmakesit clearthatthesearchshouldbetakento anotherdataset.If thesamequeryonanewdatasetreturnscategorieslike file servers,networks,CAD, etc,thentheusercanconcludethata technicaldatasethasbeenfound,andcanmakesubsequentqueriesmoretechnicalinnature.

Library catalogsystemshave long providedcategorizationinformationin the form ofsubjectheadings.Researchershave reportedthat thesekindsof headingsoftenmismatchuserexpectations(Svenonius1986),(Lancaster1986). Noreaultet al. (1981)reporton anexperimentin whichverylittle overlapoccuredin searchresultsusingcontrolledvocabularyversusfreeterms,even thoughthesearchesweredoneby thesameprofessionalsearcher,in responseto the samequeriesissuedagainstthe samedataset.However, thereis alsoevidencethat whensuchsubjectheadinginformationis combinedwith free text search,resultsareimproved(Markey et al. 1982),(Henzler1978),(Lancaster1986). HereI amsuggestingthecombinationof category informationwith termsearchcapabilities.

5.3.3 A Browsing Interface

Becauseseveral categoriescanbeassociatedwith eachretrieveddocument,a methodfor browsingthismulti-dimensionalspaceis needed.Oneapproachto thedisplayof multi-dimensionalinformationis to providetheuserwith asimplewaytocontrolwhichattributesare seenat eachpoint in time. The interfacedescribedhereallows usersto view theresultsof thequerygraphically, accordingto theintersectionof assignedcategories,usinga Venndiagramparadigm.3 Theinterface,calledCougar, combineskeywordandcategoryinformation– userscansearchoneitherkind of informationor both(seeFigure5.4). Thisallowsusersto getafeelingfor documentsimilaritybasedonthemaintopiccategoriesthey

3Michard (1982)usesa Venndiagramin a studyaboutits effectivenessin helpingnovice userscreateBooleanqueries,usingthegraphicalnotionof intersectionto indicateconjunctionof terms.Thediagramisnotusedfor displayof resultsor for conjoiningmorethanthreeterms.

Page 27: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 103

corrosion type

material

environment

Figure5.3: TheCubeof Contents(Arents& Bogaerts1993).

by theuserwhengeneratinginitial queries.It alsoallowsfor anelementof serendipity, bothin termsof whichcategoriesaredisplayedandwhatkindsof interactionsamongcategoriesmayoccur. Thisalsopreventsclutterresultingfrom displayof attributesthatarenotpresentin any retrieveddocuments2.

5.3.1 Displaying Frequent Terms

An alternative to assigningdocumentspre-definedlabelsis to simply show the doc-uments’most frequentterms. Although top-frequency termsareoften very descriptive,problemswith usingtermfrequenciesarisewhenthecontentsof many differentdocuments(or their passages)aredisplayedsimultaneously. Oneproblemis that becausetherearemany differentwordsthatcontributeto theexpressionof oneconcept,it will oftenbethecasethattwo documentsthatdiscusssomeof thesamemaintopicswill have little overlapin the termsthey useto do so. This meansthat the display will not be able to revealoverlappingthemes.

Thesecondproblemis thatwithin thedisplayof themostfrequenttermsfor adocument,severaldifferenttermswill contributeto onetheme.For example,in a chapterof deToc-queville (1835),amongthemostfrequenttermsare: judicial, judge,constitution,political,case,court,justice,magistrateaswell as: American,authority, nation,state. Thusthereisconsiderableredundancy with respectto whatkind of informationis beingconveyedby thedisplayof themostfrequentterms.

2Althoughin domain-specificsituationsit maybeusefulto show theuserwhichattributesaremissing.

Page 28: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 102

A B

C D

1

2 12

222

84 90

4442295 12818

2942612

6

127

8490

234

2846 13623

862 3424

Figure5.2: TheInfoCrystal(Spoerri1993).

(3c) Theusermight selectattributesthat do not correspondto the retrieveddocuments,thusundercuttingthegoalof supplyinginformationaboutthedocumentsreturnedinresponseto ageneralquery.

Theseproblemscanbe easilyremedied;the point hereis that the standardgoal of suchsystemsis to facilitatequeryconstructionwith attributeinformation,ratherthanenhancingdisplayof retrieval results.Note,however, thatnoneof thesedisplayparadigmscanimpartthetermdistribution informationthatTileBarsdo (seeChapter3).

To summarizethis section,previous approachesto displayingretrieval resultseitherdisplaydocumentsin termsof theiroverall similarity to oneanother, in termsof similarityto clustersformedfrom thecorpusor theretrieval set,or in termsof attributespreselectedby theuser. I have discussedproblemswith eachof theseapproaches.Thenext sectionpresentsanalternative in which thesedrawbacksareeliminated.

5.3 Multiple Main Topic Display

As mentionedin theSection5.1,I proposeanapproachin whichmultiple independentcategoriesareassignedto the“main topics”of eachdocument1. I emphasizetheimportanceof displayingall andonly theattributesthatareactuallyassignedto retrieveddocuments,rather than requiring the userto specify in advancewhich topics are of interest. Thiscircumventsproblemsarisingfromerroneousguessesandreducesthementaleffort required

1In thisdiscussion,thetermsattribute, topic, andcategory areinterchangeable.

Page 29: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 101

andcontainaswell asequenceof subtopicaldiscussions,we have anew basisonwhich todeterminein whatwayslong documentsaresimilar to oneanother. This chapterfocusesonly on accountingfor main topic information; the recognitionof subtopicstructureforinformationretrieval is a problemuntoitself andis discussedin Chapter3.

5.2.2 User-specified Attributes

Many systemsshow therelationof thecontentsof thetexts to user-selectedattributes;theseincludeVIBE (Korfhage1991),theInfoCrystal(Spoerri1993),theCubeof Contents(Arents& Bogaerts1993),andthesystemof Aboudet al. (1993).

Thesesystemsrequirethe usersto selectwhich the classificationsthe displayshouldbeorganizedaround.Thegoalof VIBE (Korfhage1991)is to displaythecontentsof theentiredocumentcollection in a meaningfulway. The userdefinesN “referencepoints”(whichcanbeweightedtermsor termweights)whichareplacedin variouspositionsin thedisplay, anda documenticon is drawn in a locationthatindicatesthedistancebetweenthedocumentandall therelevantreferencepoints.

Two interestinggraphicalapproachesare the InfoCrystaland the Cubeof Contents.TheInfoCrystal(Spoerri1993)is asophisticatedinterfacewhichallowsvisualizationof allpossiblerelationsamongN attributes.Theuserspecifieswhich N conceptsareof interest(actuallyBooleankeywordsin the implementation,but presumablyany kind of labelinginformationwould beappropriate)andtheInfoCrystaldisplays,in aningeniousextensionof theVenn-diagramparadigm,thenumberof documentsretrievedthathaveeachpossiblesubsetof theN concepts.Whenthequeryinvolvesmorethanfourtermsthecrystalsbecomerathercomplicated,althoughthereis a provision to build up querieshierarchically. Figure5.2 shows a sketchof what theInfoCrystalmight displayastheresultsof a queryagainstfour keywordsor Booleanphrases,labeledA, B, C, andD. The diamondin the centerindicatesthatonedocumentwasdiscoveredthatcontainsall four keywords. Thetrianglemarkedwith “12” indicatesthat twelve documentswerefoundcontainingattributesA, B,andD, andsoon.

TheCubeof Contentsof (Arents& Bogaerts1993)isusedtohelpauserbuild aquerybyselectingvaluesfor upto threemutuallyexclusiveattributes(seeFigure5.3). Thisassumesa text pre-labeledwith relevant informationandan understandingof domain-dependentstructuralinformationfor the documentset. Note that this is usedto specify the queryalthoughit could be usedto characterizeretrieval resultsas well. Note that only oneintersectionof two or threeattributesis viewableatany time.

The systemof Aboudet al. (1993),allows the userto specifymultiple classcriteria,wheretheclassesarespecifiedin a hierarchy, to helpnarrow or expandthesearchset.

Thelimitationswith theseapproachesare:

(3a) The attributesin questionaresimply the keywordsthe userspecifiedin the query,andsodonotaddinformationaboutthecontentsof thetexts retrieved,and/or

(3b) Theusermustexpendeffort to choosetheattributesto bedisplayed,and/or

Page 30: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 100

(Deerwesteret al. 1990), (Chalmers& Chitson1992),all work by comparingthe entirecontentof a documentagainsttheentirecontentsof otherdocumentsor queries.

Thesemodesof comparisonare appropriateon abstractsbecausemost of the (non-stopword)termsin anabstractaresalientfor retrieval purposes,becausethey actasplace-holdersfor multipleoccurrencesof thosetermsin theoriginal text, andbecausethesetermstendto pertainto themostimportanttopicsin thetext. Whenshortdocumentsarecomparedvia thevector-spacemodelor clustering,they arepositionedin a multi-dimensionalspacewheretheclosertwo documentsareto oneanother, themoretopicsthey arepresumedtohave in common.This is oftenreasonablebecausewhencomparingabstracts,thegoal isto discover which pairsof documentsaremostalike. For example,a queryagainsta setof medicalabstractswhich containstermsfor the nameof a disease,its symptoms,andpossibletreatmentsis bestmatchedagainstan abstractwith as similar a constitutionaspossible.

A problemwith applying standardinformation retrieval methodsto full-length textdocumentsis that the structureof full-length documentsis quite different from that ofabstracts.Onewayto view anexpositorytext, asmentionedin Chapter3, is asasequenceof subtopicssetagainsta “backdrop” of oneor moremain topics. The main topicsof atext arediscussedin the document’s abstract,if oneexists, but subtopicsusuallyarenotmentioned.

Most long texts discussseveral main topicssimultaneously;thus,two texts with onesharedmaintopicwill oftendiffer in theirothermaintopics.Sometopicco-occurrencesaremorecommonthanothers;e.g.,terrorismis oftendiscussedin thecontext of U.S.foreignpolicy with theMiddle East,andthesetwo themesmightevenbegroupedtogetherin somedomain-specificontologies.However, textsoftendiscussthemesthatwouldnotusuallybeconsideredto bein thesamesemanticframe;for example,Morris (1988)includesanarticlethatdescribesterroristincidentsat Bolshoiballetperformances.Therefore,I hypothesizethatalgorithmsthatsuccessfullygroupshorttextsaccordingto theiroverallsimilarity (e.g.,clusteringalgorithms,vectorspacesimilarity, andLSI),will producelessmeaningfulresultswhenappliedto full-lengthtexts.

This hypothesisis supportedby the fact that recentlyresearchersexperimentingwithretrievalagainstdatasetsconsistingof longtextshavebeenbreakingthetexts into subparts,usuallyparagraphs,andcomparingqueriesagainsttheseisolatedpieces(e.g.,Saltonet al.(1993),Salton& Buckley (1992),Al-hawamdehet al. (1991)). Thesestudiesfind thatmatchingaqueryagainsttheentiretyof along text is lesssuccessfulthanmatchingagainstindividual pieces.As furtherevidence,Voorhees(1985)performedexperiments(on stan-dardshort-text collections)which foundthat theclusterhypothesisdid not hold; that is, itwasnot thecasethattheassociationsbetweenclustereddocumentsconveyedinformationabouttherelevanceof documentsto requests.

In summary, I claim thatwhenlongdocumentsaredisplayedaccordingto how similarthey arethroughout,it canbe difficult to discernwhy they weregroupedtogetherif thisgroupingis afunctionof someintermediatepositionin multi-dimensionalspace.If insteadwe recognizethat long texts canbe classifiedaccordingto several differentmain topics,

Page 31: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 99

The simplestapproachto displayingretrieval resultsis, of course,to list the titles orfirst linesof theretrieveddocuments.Onealternative, theTileBar display, is describedinChapter3. Othersystemsthatdomorethanthiscanbecharacterizedasperformingoneoftwo functions:

(1) Displaying the retrieved documentsaccordingto their overall similarity to otherretrieveddocuments,and/or

(2) Displayingthe retrieveddocumentsin termsof keywordsor attributespre-selectedby theuser.

Both of theseapproaches,andtheir drawbacks,arediscussedin the subsectionsthatfollow.

5.2.1 Overall Similarity Comparison

Severalsystemsdisplaydocumentsin whatcanbedescribedasasimilarity network.Afocusdocument,usuallyonethattheuserhasexpressedinterestin, is shown asanodein thecenterof thedisplay, anddocumentsthataresimilartothefocusdocumentarerepresentedasnodeslinkedby edgessurroundingthefocusdocumentnode.Heresimilarity is measuredin termsof the vectorspacemodelor a probabilisticmodel’s measureof probability ofrelevance.

Systemsof this kind include the Bead system(Chalmers& Chitson1992), whichdisplaysdocumentsaccordingto their similarity in a two-dimensionalrenditionof multi-dimensionaldocumentspace,I3R(Thompson& Croft1989)andthesystemof Fowleret al.(1991),whichdisplayretrieveddocumentsin networksbasedon interdocumentsimilarity.

A differentway to displaydocumentsaccordingto their inter-similarity is to clusterthe resultsof the retrieval andmakevisible the clustercentroidsandthe distanceof thedocumentsfrom eachcentroid.Scatter-Gather(Cuttinget al. 1992),(Cuttinget al. 1993)isaninnovative,query-freebrowsingtechniquethatallowsusersto becomefamiliar with thecontentsof acorpusby interactively clusteringsubpartsof thecollectionto createtable-of-contents-likedescriptions.This techniqueis very effective on shortertexts but, asarguedbelow, will probablybe lesseffective on collectionsof longertexts. Additionally, Scat-ter/Gatheremphasizesquery-freebrowsing,althoughit couldbeaugmentedwith Booleanandsimilarity search.

Drawbacks of Comparing Full-Length Texts

Most (non-Boolean)information retrieval systemsuse inter-documentsimilarity tocomparedocumentstoaqueryanddeterminetheirrelevance.Forexample,thevectorspacemodelof similarity search(Salton1988),clustering,e.g.,(Cuttinget al. 1992),(Griffithset al. 1986),andlatentsemanticindexing for determininginter-documentsimilarity, e.g.,

Page 32: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER5. MULTIPLE MAIN TOPICSIN INFORMATION ACCESS 98

� � � � � � � � � � � � � �� � � � � � � � � � �

� � � � � � � � � � �� � � � � � � �� � � � � � � �

� � � � � � � �� � � � � � � �

� � � � � � � �� � � � � � � �

Figure5.1: Retrieval of passagesfrom full-length text: thecontexts in which thelocalizeddiscussionstakeplacemaybeentirelydifferentfrom oneanother.

� The documents’contentsare representedby multiple independentattributes thatcharacterizethemaintopicsof thetext.� Thesystemdisplaysall andonly theattributesor topicsthatareassignedasa resultof thequery, asopposedto displayingdocumentsthatmeetpre-selectedattributes.� Thesystemallowsdisplayof interactionsamongtheattributes.

Thenext sectionexpandson thediscussionof relatedwork andexplainsthedrawbacksof the two mostcommonretrieval displayoptionswith respectto passageretrieval anddatasetfamiliarization. Section5.3 presentsan alternative approachwhich makesuseofcategory informationin orderto indicatethemain topic discussionsof texts. Section5.4summarizesthechapter.

5.2 Current Approaches

Textual information doesnot conform to the expectationsof sophisticateddisplayparadigms,suchasthoseseenin theInformationVisualizer(Robertsonet al. 1993).Thesetechniqueseitherrequiretheinput to bestructured(e.g.,hierarchical,for theConeTree)orscalaralongat leastonedimension(e.g.,for thePerspective Wall). However, theaspectsof adocumentthatsatisfythesecriteria(e.g.,a timelineof documentcreationdates)donotilluminatetheactualcontentof thedocuments.

Page 33: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

97

Chapter 5

Multiple Main Topics in InformationAccess

5.1 Introduction

In thisChapterI addresssomeissuesrelatingto displayof resultsof retrieval from full-text collections.I claim thatdisplayingqueryresultsin termsof inter-documentsimilarityis inappropriatewith long texts, andsuggestinsteadassigningcategoriesthatcorrespondto documents’main topics. I arguethat main topicsof long texts shouldbe representedby multiplecategories,sincein many casesonecategorycannotadequatelyclassifya text.Thedisplaymakesuseof theautomaticcategorizationalgorithmdescribedin Chapter4.I introduceCougar, a browsinginterfacethatpresentsa simplemechanismfor displayingmultiplecategory information.

An increasinglyimportantconcernto informationaccessis that of passageretrievalfrom full-text documentcollections.Full-lengthexpositorytextscanbethoughtof in termsof asequenceof subtopicaldiscussionstiedtogetherby oneor moremaintopicdiscussions(seeChapter3). Two differentpassages,both of which sharetermswith a query, mayoriginatein documentswith entirelydifferentmaintopicdiscussions.For example,Figure5.1 shows a sketchin which threedifferentpassage-level discussionsof volcanicactivitytakeplacein threedifferentmaintopiccontexts (explorationof Venus,Romanhistory, andtheeruptionof Mt. St. Helens).Usersshouldreceivesomeindicationof thecontexts fromwhich a setof retrievedpassagesoriginatedin orderto decidewhich passagesareworthfurtherscrutiny.

In the text retrieval scenarioof retrieval of passagesfrom long texts it is importanttosupply the userwith information that placesthe resultsin a meaningfulcontext. Mostexisting approachesto displayof retrieval resultscanbecharacterizedin two ways: all ofthereturneddocumentsaredisplayedeither(i) accordingto their overall similarity to oneanother, or (ii) in termsof user-selectedkeywordsor attributesthey areassociatedwith. Isuggestanalternativeviewpointwith thefollowing characteristics:

Page 34: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 96

United States ConstitutionOriginalCategories Super-Categories

0 assembly(court,legislature) legal system1 dueprocessof law government2 legal documentlegal instrument politics3 administrative unit conflict4 body(legislative) crime5 charge(taxes) finance6 administratordecisionmaker socialstanding7 documentwritten document honesty8 approval (sanction,pass) communication

GenesisOriginalCategories Super-Categories

0 deitydivinity god religion1 relative relation(mother, aunt) breads2 worship mythology3 manadult male people4 professional socialoutcasts5 happinessgladnessfelicity socialgroup6 womanadult female psychologicalstate7 evildoing transgression personality8 literary composition literature

Figure4.7: Comparisonof originalandsupercategoriesfor two well-known texts.

Page 35: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 95

RaisaGorbachev articleOriginalCategories Super-Categories

0 womanadult female socialstanding1 statussocial state education2 manadult male politics3 political orientationideology legal system4 forcepersonnel people5 charge psychologicalstate6 relationship socializing7 fear socialgroup8 attitude personalrelationship9 educatorpedagogue government

MagellanspaceprobearticleOriginalCategories Super-Categories

0 celestialbodyheavenly body outerspace1 mollusk genus light andenergy2 electromagneticradiation atmosphere3 layer(surface) landterrafirma4 atmosphericphenomenon physics5 physicalphenomenon arrangement6 goddess shapes7 naturaldepressiondepression waterand liquids8 rockstone properties9 space(hole) amounts

Figure4.6: Comparisonof originalandsupercategories.

Page 36: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 94

d w e l l i n g

r e a s o n i n g

g e o l o g i c a l _ t i m e

t i m e

h o n o r a b l e n e s s

d i s h o n o r a b l e n e s s

m a n n e r

c h r o m a t i c _ c o l o r

s o u n d _ p r o p e r t y

t a s t e _ p r o p e r t y

m e d i a

t r a n s m i s s i o n

a r t i f i c i a l _ l a n g u a g e

S i n o - T i b e t a n

G e r m a n i c

I ta l i cI n d o - I r a n i a n

I n d o - E u r o p e a n

U r a l - A l t a i c

D r a v i d i a n

A f r o - A s i a t i c

m u s i c a l _ n o t a t i o n

w r i t i n g

l i t e r a r y _ c o m p o s i t i o n

t e x t

s a c r e d _ t e x t

l is t

r e c o r d

c o m m e r c i a l _ d o c u m e n t

l e g a l _ d o c u m e n t

d o c u m e n t

d r a m a t i c _ c o m p o s i t i o n

w r i t t e n _ c o m m u n i c a t i o n

a c k n o w l e d g m e n t

d a t a b a s e

a p p r o v a l

f a l s e h o o d

m e s s a g e

e x p r e s s i v e _ s t y l e

u t t e r a n c e

s p e e c h

c o m m u n i c a t i o n

l i n g u i s t i c _ r e l a t i o n

m a g n i t u d e _ r e l a t i o n

c o m p a s s _ p o i n t

d i r e c t i o n

s a m e n e s s

e l e c t r o m a g n e t i c _ u n i t

m o n e t a r y _ u n i t

r e a l _ n u m b e r

c o n t a i n e r f u l

h o l i d a y

d a y

c a l e n d a r _ m o n t h

t i m e _ p e r i o d

t i m e _ u n i t

s c o m b r o i d

s c o r p a e n o i d

s p i n y - f i n n e d _ f i s h

a c c o u n t

a c i d

a c q u i s i t i o n

k i l l i ng

t r a v e l

a t h l e t i c _ g a m e

s p o r t

s o c i a l _ d a n c i n g

d a n c i n g

m u s i c

s t r o k e

m a n e u v e r

c a r e

c r i m e

d i s h o n e s t y

w r o n g d o i n g

c o n s u m p t i o n

e d u c a t i o n

c r e a t i o n

c e r e m o n y

w o r s h i p

a t t a c k

o p e r a t i o n

m i l i t a r y _ a c t i o n

c o m m e r c i a l _ e n t e r p r i s e

c o m m e r c e

m a n a g e m e n t d u e _ p r o c e s s

E u r o p e a n _ c o u n t r y

A f r i c a n _ c o u n t r y

A s i a n _ c o u n t r y

c o u n t r y

A m e r i c a n _ s t a t e

s t a t e

a d m i n i s t r a t i v e _ u n i t

a d m i n i s t r a t o r

j e w e l r y

e d u c a t o r

m e d i c a l _ p r a c t i t i o n e r

p r o f e s s i o n a l

w o m a n

m a n

a n o m a l y

c r i m i n a l

m e r c h a n t

b u s i n e s s p e r s o n

w r i t e r

a t h l e t e

c o n t e s t a n t

a r t i s t

l a w m a n

c o m b a t a n t

d i s p u t a n t

m u s i c i a n

p e r f o r m e r

s c h o l a r

s c i e n t i s t

i n te l l ec tua l

s p i r i t u a l _ l e a d e r

a r i s t o c r a t

c i t i z e n

p e e r

r e l a t i v e

r e l i g i o n i s t

u n p l e a s a n t _ p e r s o n

e m p l o y e e

c r a f t s m a n

m i l i t a r y _ o f f i c e r

s e r v i c e m a n

s k i l l e d _ w o r k e r

w o r k e r

d e i t y

a g e n t

w i n d

a i r c r a f t

w i n e

l i quo r

a l c o h o l

a l g a

c o u r s e

s o u pd i s h

c a n d yd a i n t y

p a s t r y

c o o k i e

c a k e

q u i c k _ b r e a d

b r e a d

cu t

m e a t

s e a _ f i s h

s h e l l f i s h

s e a f o o d

c h e e s ed a i r y _ p r o d u c t

a l i m e n t

s a l a m a n d e r

d i c o t

f i n c h

w a r b l e r

p a s s e r i n e

b i r d _ o f _ p r e y

g a l l i n a c e o u s _ b i r d

a n s e r i f o r m _ b i r d

c i c o n i i f o r m _ w a d i n g _ b i r d

g r u i f o r m _ w a d i n g _ b i r d

l i m i c o l i n e _ b i r d

a r c h o s a u r i a n

c o l u b r i d _ s n a k e

s n a k e

a q u a t i c _ m a m m a l

h o u n d

t e r r i e r

s p o r t i n g _ d o g

w o r k i n g _ d o g

d o m e s t i c _ d o g

c a n i n e

f e l i n e

c a r n i v o r e

b a t

r o d e n t

h o r s e

r u m i n a n t

m o n k e y

p r i m a t e

e l a s m o b r a n c h

c y p r i n i f o r m _ f i s h

s o f t - f i n n e d _ f i s h

t e l e o s t _ f i s h

f i sh

v e r t e b r a t e

a r a c h n i d

c r u s t a c e a n

h y m e n o p t e r o u s _ i n s e c th o m o p t e r o u s _ i n s e c t

m o t h

b i v a l v e m o l l u s k

i n v e r t e b r a t e

a n i m a l

a n i m a l _ d i s e a s e

a n i m a l _ p r o d u c t

a n i m a l _ s k i n

m u s c l e

n e r v o u s _ t i s s u e

a p p a r a t u s

o v e r g a r m e n t

g a r m e n t

h a t

h e a d d r e s s

c l o t h i n g

e x t r e m i t y

a p p l i a n c e

a p p r a i s a l

t r a c t

n a t i o n a l _ c a p i t a l

s t a t e _ c a p i t a l

c a p i t a l

c i t y

g u n

w e a p o n r y

a r r a n g e m e n t

s c u l p t u r e

m u s i c a l _ c o m p o s i t i o n

m u s i c

a r t

p i e c e _ o f _ c l o t h

f ab r i c

n e e d l e w o r k

r e m e d y

b a g

b o x

c a s e

bo t t l e

v e s s e l

c o n t a i n e r

m o t o r _ v e h i c l e

b o a t

w a r s h i p

w h e e l e d _ v e h i c l e

v e h i c l e

c o n v e y a n c e

c o a t i n g

c o v e r i n g

c o n t r o l

p e r c u s s i o n _ i n s t r u m e n t

s t r i n g e d _ i n s t r u m e n t

w i n d _ i n s t r u m e n t

t i m e p i e c e

s c i e n t i f i c _ i n s t r u m e n t

c o n d u c t o r

e l e c t r o n i c _ d e v i c e

l i gh t

m o t o r m a c h i n e

m e m o r y _ d e v i c e

o p t i c a l _ d e v i c e

f a s t e n e r

r e s t r a i n t

e l e c t r o n i c _ e q u i p m e n t

s p o r t s _ e q u i p m e n t

s e a t

f u rn i t u re

c o o k i n g _ u t e n s i l

c r o c k e r y

t a b l e w a r e

u tens i l

b a r

s p o r t s _ i m p l e m e n t

b r i d g e

s h e l t e r

h o u s i n g

bu i l d i ng

p l a c e _ o f _ b u s i n e s s

b a r r i e r

b e a m

to i l e t r y

d e c o r a t i o n

t ube

c o n d u i t

r o a d

w a y

p l a y t h i n g

s i d e

s t r i p

b u i l d i n g _ m a t e r i a l

A s c o m y c e t e

a s s e m b l y

s h a r e

m a t e r i a l _ r e s o u r c e

C o m p o s i t a e

a t m o s p h e r i c _ p h e n o m e n o n

p o l i t i c a l _ o r i e n t a t i o n

o r i e n t a t i o n

a t t i t u d e

p e r c e p t i o n

T r i c h o l o m a t a c e a e

B a s o m y c e t e

F a g a c e a e

s p e r m a t o p h y t e

P o l y p o d i a c e a e

p t e r i d o p h y t e

h e r bp o i s o n o u s _ p l a n t

v i n e

f r u i t _ t r e e

t r e e

s h r u b

w o o d y _ p l a n t

p l a n tm o n e r a n

i m a g i n a r y _ b e i n g

d o c t r i n e

be l i e f

b e v e r a g e

p h y l u m

c l a s s

C h e n o p o d i a c e a e

C r u c i f e r a e

G r a m i n e a e

L i l i a c e a e

P i n a c e a e

f a m i l y

f i s h _ g e n u s

r e p t i l e _ g e n u s

m a m m a l _ g e n u s

m o l l u s k _ g e n u s

p l a n t _ g e n u s

g e n u s

t a x o n o m i c _ g r o u p

b l i gh t

b o d y _ c o v e r i n g

l i q u i d _ b o d y _ s u b s t a n c e

r i v e r

m a i n

b o d y _ o f _ w a t e r

e x t e r n a l _ b o d y _ p a r t

t ube

b o n e b o d y _ p a r t

o v u l e

l e a f

p l a n t _ o r g a n

b u s i n e s s

c a t e g o r y

c e l e s t i a l _ b o d y

ce l l

c h a r g ec h a r g e

p r o t e i n

o r g a n i c _ c o m p o u n d

s a l t

c o m p o u n d

C h r i s t i a n _ c h u r c h

r e l i g i o n

c l i m a t e

s y m p t o m

w o r d

c o n c e p t

p l a n

m a t h e m a t i c s

m e d i c i n e

p h y s i c s

s c i e n c e

h u m a n i s t i c _ d i s c i p l i n e

i n f e c t i o u s _ d i s e a s e

g a m e

r a c e

c o n t e s t

s a u c e

c o n d i m e n t

d i s e a s e

i l l _ h e a l t h

g e m

p r o c e s s i n g

d o c u m e n t

i s l a n d

s c h o o l

n a t u r a l _ e l e v a t i o n

h a p p i n e s s

s a d n e s s

f e e l i n g

e n t e r p r i s e

s e c r e t o r y _ o r g a n

r o c k

g e o l o g i c a l _ f o r m a t i o n

f ru i t

l e g u m e

g r e e n sv e g e t a b l e

h e r b

f l a v o r e r

f o o d s t u f f

f o o d

r o c k

f i be r

n a t u r a l _ r e s i n

r e s i n

w o o d

i ns t i t u t i on

w a t e r

l i qu id

m i l i t a r y_un i t

f o r c e

m u s i c a l _ o r g a n i z a t i o n

o r g a n i z a t i o n

s o c i a l _ g r o u p

l i ne

r o u n d _ s h a p e

g a i n

g r o u p

l i ne s u r f a c e

p r o c e s s

p a y m e n t

p l a s t i c

v i s u a l _ c o m m u n i c a t i o n

p e r c o i d _ f i s h

w o r k m a ng o d d e s s

r a t e

d i p t e r o u s _ i n s e c t

b e e t l e

d i v i s i o n

u n s k i l l e d _ p e r s o n

o r d e r

g r a p h i c _ a r t

s a n i t a r y _ c o n d i t i o n

g e o g r a p h i c a l _ a r e a

d i s t r i c t

Figure4.5: A pieceof the category network. The groupingalgorithmfinds relatednessbetweencategoriesthatarenearoneanotherin WordNet(e.g.,the food terms)aswell ascategoriesthatarefar apart(e.g.,“sportsequipment”with “athlete”).

Page 37: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 93

simultaneously. This is anareafor futurework.

Page 38: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 92

power, discussestheir role asworking women,anddescribesthe benefitsof college life.The secondarticle is a 77-sentencepopularsciencemagazinearticleaboutthe Magellanspaceprobeexploring Venus.Whenusingthesuper-categories,thelabeleravoidsgrosslyinappropriatelabelssuchas“mollusk genus”and“goddess”in the Magellanarticle,andcombinescategoriessuchas“layer”, “natural depression”,and“rock stone”into theonesuper-category “land terrafirma”.

Lookingagainatthelongertextsof theUnited States Constitution andGenesis weseeinFigure4.7thatthesuper-categoriesaremoregeneralandlessredundantthanthecategoriesshown in Table4.4. (Althoughthehigh scoresfor the“breads”category seemsincorrect,even thoughthe term“bread” occurs25 timesin Genesis.) In somecasestheusermightdesiremorespecificcategories;thisexperimentsuggeststhatthelabelercangeneratetopiclabelsatmultiple levelsof granularity.

Section4.4evaluatestheresultsof assigningtopicsbasedon thesupercategories;how-everwe havenot rigorouslycomparedthesupercategoriesagainsttheoriginal categories.

4.6 Conclusions

Thischapterhaspresentedanalgorithmthatautomaticallyassignsmultiplemaintopiccategories to texts, basedon computingthe posteriorprobability of the topic given itssurroundingwords,withoutrequiringpre-labeledtrainingdataor heuristicrules.Thealgo-rithmsignificantlyoutperformsabaselinemeasureandapproachesthelevelsof inter-indexerconsistency displayedby nonprofessionalhumanindexers. Thechapteralsodescribestheconstructionof a generalcategory set from a hand-built lexical hierarchy. The structureof the WordNethyponym hierarchyis large and uneven; the bracketingalgorithm pro-videsa simpleandeffective way to automaticallysubdivide it. The algorithmthat usesWordSpaceto combinedistantpartsof the hierarchyis partially effective, but requiresamanualpostprocessingpass.

Thecategorizationalgorithmis effectiveontextsthathavestrongthematicdiscussions,but many kinds of improvementsandalternativesremainto be explored. If a documentcontainstermswhicharemembersof smallcategories,orcategorieswhosetermsoccuronlyrarely, thenthealgorithmerroneouslyassignstoo muchweight to theserarersenses.Ananalysisof thetermswhoseweightsaremoststronglyassociatedwith eachcategorywouldbeusefulfor analyzinghow to fix this problem.Finally, becausethegoalof thealgorithmis to allow assignmentof multiple categoriesto documents,in the casesin which severalcategorieshave significantoverlapin meaning,e.g.,reptilesandbirds, thealgorithmtendsto assignbothcategoriesto thedocument,eventhoughahumanindexerusuallywouldnot.

Fisher(1994)hasperformeda seriesof experimentsthat comparevariationsof thisalgorithm. Preliminaryresultsindicatethat usingdirect countsof category membershipcanimprove theresults.

It would be interestingto try the training loop ideain which the outputof TextTilingis usedasinput to thecategory trainingalgorithm,andsoon, improving bothalgorithms

Page 39: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 91

thatthey mutuallyrankhighly. To determinethatcloseassociationis mutualbetweentwocategories,we checkfor mutualhigh ranking. Thuscategory

and ! aregroupedtogether

if andonly if

ranks! highly and ! ranks

highly (where“highly” wasdeterminedby acutoff value–

and! hadto beranked" or abovewith respectto eachother, for a threshold" ).

Theresultsof thisalgorithmarebestinterpretedviaagraphicallayout.Figure4.5showsa pieceof a networkcreatedusinga presentationtool (Amir 1993)basedon theoreticalworkbyFruchtermann& Rheingold(1990).Theunderlyingalgorithmusesaforce-directedplacementmodelto layoutcomplex networks(edgesaremodeledassprings;nodeslinkedby edgesareattractedto eachother, but all other pairsof nodesare repelledfrom oneanother).

In thesenetworksonly connectivity hasmeaning;distancebetweennodesdoesnotconnotesemanticdistance. The connectivity of the network is interestingalsobecauseit indicatesthe interconnectivity betweencategories. From Figure4.5, we seethat cate-goriesassociatedwith thenotionsports, suchasathleticgame,race,sportsequipment,andsportsimplement,have beengroupedtogether. Athleticsis linked to vehicleandcompeti-tion categories;thesein turn link to military vehiclesandweaponrycategories,whichthenleadin to legalcategories.

The networkalso shows that categoriesthat are specifiedto be nearone anotherinWordNet,suchasthecategoriesrelatedto bread, arefoundto becloselyinterrelated.Thisis usefulin casewewould like to begin with smallercategories,in orderto eliminatesomeof thelarge,broadcategoriesthatwearecurrentlyworkingwith.

Most of theconnectivity informationsuggestedby thenetworkwasusedto createthenew categories.However, many of thedesirablerelationshipsdonotappearin thenetwork,perhapsbecauseof therequirementfor highly mutualco-ranking.If we wereto relaxthisassumptionwe mayfind bettercoverage,but perhapsat thecostof moremisleadinglinks.The remainingassociationsweredeterminedby hand,so that the original 726 categorieswerecombinedinto 106new super-categories.

4.5.4 Revised Topic Assignments

Thesuper-categoriesareintendedto grouptogetherrelatedcategoriesin orderto elim-inatetopicalredundancy in thelabelerandto helpeliminateinappropriatelabels(sincethecategoriesarelargerandsohavemorelexical itemsservingasevidence).Thusthetopfouror fivesuper-categoriesshouldsuffice to indicatethemaintopicsof documents.

Figure4.6comparestheresultsof the labelerusingtheoriginal categoriesagainstthesuper-categories. Thenumbersbesidethecategory namesarethescoresassignedby thealgorithm;thescoresin bothcasesareroughlysimilar. It is importantto realizethatonly thetop four or five labelsareto beusedfrom thesuper-categories;sinceeachsuper-categorysubsumesmany categories,only a few super-categoriesshouldbeexpectedto containthemost relevant information. The first article is a 31-sentencemagazinearticle, publishedin 1987, takenfrom Morris (1988). It describeshow Soviet womenhave little political

Page 40: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 90

aretoofine-grainedto indicatemaintopic information.In anearlierimplementationof thisalgorithm,thecategorieswerein generallargerbut

lesscoherentthanin thecurrentset.Thelargercategoriesresultedin better-trainedclassi-fications,but theclassesoftenconflatedquitedisparateterms.Thecurrentimplementationproducessmaller, morecoherentcategories.Theadvantageis thatamoredistinctmeaningcan be associatedwith a particularlabel, but the disadvantageis that in many casessofew of thewordsin thecategory appearin the trainingdatathata weakmodelis formed.Then the categorieswith little distinguishingtraining datadominatethe labeling scoresinappropriately.

In thecategory-derivationalgorithmdescribedabove, in orderto increasethesizeof agivencategory, termsmustbetakenfromnodesadjacentin thehierarchy(eitherdescendantsor siblings).However, adjacenttermsarenot necessarilycloselyrelatedsemantically, andsoafterapoint,expandingthecategoryvia adjacenttermsintroducesnoise.To remedythisproblem,WordSpaceis usedto determinewhichcategoriesaresemanticallyrelatedto oneanother, despitethefact thatthey comefrom quitedifferentpartsof thehierarchy, sotheycanbecombinedto form schema-likeassociations.

4.5.3 Combining Distant Categories

To find whichcategoriesshouldbeconsideredclosestto oneanother, wefirst determinehow closethey arein WordSpace(Schutze1993b)andthengroupcategoriestogetherthatmutually rankedoneanotherhighly.6 WordSpaceis a corpus-basedmethodfor inducingsemanticrepresentationsfor a largenumberof wordsfrom lexical coocurrencestatistics.Themediumof representationis amulti-dimensional,real-valuedvectorspace.Thecosineof the anglebetweentwo vectorsin the spaceis a continuousmeasureof their semanticrelatedness.

First-degreeclosenessof two categories#%$ and #%& is definedas:

')( #%$+*,#-&/.10 12

12 # $ 232 # & 25467/8:9<; 46=>8:93?A@ (CBD * BE .where@ is:

@ (CBD * BE .F0 4 $ ( D $HG E $ . 2The primary rank of category

for category ! indicateshow closelyrelated

is to !

accordingto first-degreecloseness.For instancerank1 meansthat

is theclosestcategoryto ! , andrank3 meansthereareonly two closercategoriesto ! than

.

We definesecond-degreeclosenessfrom the primary ranks. Secondaryranking isneededbecausesomecategoriesareespecially“popular,” attractingmany othercategoriesto them;thesecondaryrankenablesthepopularcategoriesto retainonly thosecategories

6All work involving theWordSpacealgorithmwasdonein collaborationwith Hinrich Schutze. We aregratefulto RobertWilensky for suggestingcollaborationon thisproblem.

Page 41: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 89

United States Constitution Genesis0 assembly(court,legislature) deitydivinity god1 dueprocessof law relative relation(mother, aunt)2 legal documentlegal instrument worship3 administrative unit manadult male4 body(legislative) professional5 charge(taxes) happinessgladnessfelicity6 administratordecisionmaker womanadult female7 documentwritten document evildoing transgression8 approval (sanction,pass) literary composition9 powerpowerfulness religionistreligiousperson

Figure4.4: Outputusingoriginalcategoryseton two well-known texts.

then the procedureis called recursively on the child. Otherwise,the child is too smallandis left alone. After all of N’s childrenhave beenprocessed,the category that N willparticipatein hasbeenmadeassmallasthealgorithmwill allow. Thereis a chancethatN andits unmarkeddescendantswill now makea category that is too small,andif this isthe case,N is left alone,anda higher-up nodewill eventuallysubsumeit (unlessN hasnoparentsremaining).Otherwise,N andits remainingunmarkeddescendantsarebundledinto a category.

If N hasmore thanoneparent,N canendup assignedto the category of any of itsparents(or none),dependingon which parentwasaccessedfirst andhow many unmarkedchildrenit hadat any time,but eachsynsetis assignedto only onecategory.

The function “mark” placesthe synsetandall its descendentsthat have not yet beenenteredinto a category into a new category. Notethat#descendentsis recalculatedin thethird-to-lastline in caseany of thechildrenof N have beenenteredinto categories.

In the end theremay be isolatedsmall piecesof hierarchythat aren’t storedin anycategory, but thiscanbefixedby acleanuppass,if desired.

4.5.2 Assigning Topics using the Original Category Set

Using the 726 categoriesderived from WordNet, the category assignmentalgorithmproducestheoutputshown in Figure4.4 for two well-known texts (madeavailableonlineby ProjectGutenberg). The first column indicatesthe rank of the category, the secondcolumnindicatesthescorefor comparisonpurposes,andthethird columnshowsthewordsin thesynsetat thetop-mostnodeof thecategory(thesearenotalwaysentirelydescriptive,sosomeglossesareprovidedin parentheses).

Notethatalthoughmostof thecategoriesareappropriate(with theglaringexceptionof“professional”in Genesis), thereis someredundancy amongthem,andin somecasesthey

Page 42: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 88

for each synset N in the noun hierarchya_cat(N)

a_cat(N):if N has not been entered in a category

T <- #descendents(N)

if ((T >= LOWER_BRACKET)&& (T <= UPPER_BRACKET))

mark(N,NewCatNumber)

else if (T > UPPER_BRACKET)

for each (direct) child C of NCT <- #descendents(C)if ((CT >= LOWER_BRACKET)

&& (CT <= UPPER_BRACKET))mark(C,NewCatNumber)

else if (CT > UPPER_BRACKET)a_cat(C)

T <- #descendents(N)if (T >= LOWER_BRACKET)

mark(N,NewCatNumber)

Figure4.3: Algorithm for creatingcategoriesfrom WordNet’snounhierarchy.

Page 43: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 87

4.5.1 Creating Categories from WordNet

An algorithmis neededto decomposetheWordNetnounhierarchyinto asetof disjointcategories,eachconsistingof a relatively largenumberof synsets,creatingcategoriesof aparticularaveragesizewith assmalla varianceaspossible,whereeachcategory consistsof a relatively largenumberof synsets(this is necessaryfor thetext-labelingtask,becauseeachtopic mustbe representedby many differentterms). Thereis somelimit asto howsmallthisvariancecanbebecausethereareseveralsynsetsthathaveaverylargenumberofchildren(therearesixteennodes(synsets)with a branchingfactorgreaterthan100). Thisprimarily occurswith synsetsof a taxonymic flavor, i.e.,mushroomspeciesandlanguagesof theworld. Therearetwo otherreasonswhy it is not straightforwardto find uniformlysized,meaningfulcategories:

(i) Thereis no explicit measureof semanticdistanceamongthechildrenof a synset.

(ii) The hierarchyis not balanced,i.e., the depthfrom root to leaf variesdramaticallythroughoutthehierarchy, asdoesthebranchingfactor. (Thehierarchyhasten rootnodes;on averagetheirmaximumdepthis 10.5andtheirminimumdepthis 2.)

Reason(ii) rulesouta strategy of travelingdown auniformdepthfrom therootor upauniformheightfrom theleavesin orderto achieveuniformcategorysizes.

For thepurposesof thedescriptionof thisalgorithm,asynsetis anodein thehierarchy.A descendantof synsetN is any synsetreachablevia a hyponym link from N or any ofN’s descendants(recursively). This meansthat intermediate,or non-leafsynsets,arealsoclassifiedasdescendants.Theterm“child” refersto animmediatedescendant,i.e.,asynsetdirectly linked to N via a hyponym link, and “descendant”to indicatelinkage throughtransitiveclosure.

Thealgorithmusedhereis controlledby two parameters:upperandlower boundsonthe category size(seeFigure4.3). For example,the resultof settingthe lower boundto25 andthe upperboundto 60 yieldscategorieswith anaveragesizeof 58 members.AnarbitrarynodeN in thehierarchyis chosen,andif it hasnot yetbeenmarkedasa memberof acategory, thealgorithmchecksto seehow many unmarkeddescendantsit has.In everycase,if the numberof descendantsis too small, the assignmentto a category is deferreduntil a nodehigher in the hierarchyis examined(unlessthe nodehasno parents). Thishelpsavoid extremelysmallcategories,whichareespeciallyundesirable.

If thenumberof descendantsof N fallswithin theboundaries,thenodeanditsunmarkeddescendantsarebundledinto anew category, marked,andassignedalabelwhichis derivedfrom the synsetat N. Thus, if N andits unmarkeddescendantscreatea category with kmembers,thenumberof unmarkeddescendantsof theparentof N decreasesby k.

If N hastoo many descendants,that is, thecountof its unmarkeddescendantsexceedsthe upperbound,theneachof its immediatechildren is checkedin turn: if the child’sdescendantcount falls betweenthe boundaries,then the child and its descendantsarebundledinto acategory. If thechild andits unmarkeddescendantsexceedtheupperbound,

Page 44: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 86

betweendifferentsensesof homographs.Associatedwith eachsynsetis a list of relationsthatthesynsetparticipatesin. Oneof these,in thenoundataset,is thehyponymy relation(andits inverse,hypernymy),roughlyglossedasthe“ISA” relation.Thisrelationimposesahierarchicalstructureon thesynsets,indicatinghow to generalizefrom asubordinatetermto a superordinateone,andvice versa.5 This is a veryusefulkind of informationfor manytasks,suchasreasoningwith generalizationsandassigningprobabilitiesto grammaticalrelations(Resnik1992).

This lexicon mustbe adjustedin two waysin orderto facilitate the label assignmenttask. Thefirst is to collapsethefine-grainedhierarchicalstructureinto a setof coarsebutsemantically-relatedcategories.Thesecategorieswill provide thelexical evidencefor thetopic labels. (After the label is assigned,the hierarchicalstructurecanbe reintroduced.)Oncethe hierarchyhasbeenconvertedinto categories,the categoriescanbe augmentedwith new lexical itemsculledfrom freetext corpora,in orderto furtherimprovethelabelingtask.

Thesecondwaythelexiconmustbeadjustedis to combinecategoriesfrom distantpartsof thehierarchy. Of particularinterestaregroupingsof termsthatcontributeto a frameorschema-likerepresentation(Minsky 1975); this canbe achieved by finding associationallexical relationsamongthe existing taxonymic relations. For example,WordNethasthefollowing synsets: “athletic game” (hyponyms: baseball,tennis), “sports implement”(hyponyms: bat,racquet),and“tract, pieceof land” (hyponyms: baseballdiamond,court),noneof which areclosely relatedin the hierarchy. We would like to automaticallyfindrelationsamongcategoriesheadedby synsetslike these. (In Version1.3, the WordNetencodershave placedsomeassociationallinks amongthesecategories,but still only someof thedesiredconnectionsappear.)

In otherwords,links shouldbederivedamongschematicallyrelatedpartsof thehierar-chy, wheretheselinks reflectthetext genreonwhichtext processingis to bedone.Schutze(1993b)describesa method,calledWordSpace,that representslexical itemsaccordingtohow semanticallyclosethey areto oneanother, basedon evidencefrom a large text cor-pus. To createstructuredassociationalinformation,the term-similarityinformationfromWordSpaceis combinedwith the category informationderived from WordNet to createschema-likesuper-categories.

The next subsectiondescribesthe algorithm for converting the WordNet hierarchyinto a setof categories.This is followed,in subsection4.5.2by a discussionof how thesecategoriesaretobeusedandwhythey needtobeimproved.Subsection4.5.3describeshowWordSpaceanbeusedto bring disparatecategoriestogetherto form schematicgroupingswhile retainingthegivenhierarchicalstructure.

5Actually, thehyponomyrelationis a directedacyclic graph,in thata minority of thenodesarechildrenof morethanoneparent.I will at timesreferto it asahierarchynonetheless.

Page 45: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 85

surprisingsinceanimaltermswill tendto occurin similar contexts in training,andsincethecategorieswerenot trainedto bemutuallyexclusive.

Anotherway to measuretheresultsis to determinehow oftentheprogramassignsthemostimportantcategories.Overall, theprogram’sperformancewasstrongwith respecttochoosinghighest-rankedcategories.

If amajority( I 3)of thejudgesagreeonthetop-rankedcategory, thiscategoryiscalledthemajority top choice.A category thatis rankedhighestby at leastonejudgeis referredto asa minority top choice. Eight out of thetendocumentshadmajority top choices.Ofthese,in four cases,theprogram’s top choicewasthe majority top choice. In two cases,the program’s top choicematcheda minority top choice. In oneof the remainingcases,theprogramwasoff-by-one,rankingthemajority top choicesecond,andin theother, themajority topchoicewastheprogram’seighthchoice.In thetwo remainingcasesin whichnomajority existed,theprogram’s topchoicematcheda minority topchoice.

4.5 Creating Thesaural Categories

Recently, much effort hasbeenappliedto the creationof lexicons and the acquisi-tion of semanticand syntacticattributes of the lexical items that comprisethem, e.g,Alshawi (1987),Calzolari& Bindi (1990),Grefenstette(1992),Hearst(1992),Markowitzet al. (1986),Pustejovsky (1987),Schutze(1993a),Wilks et al. (1990).However, a lexiconasgivenmaynotsuit therequirementsof aparticularcomputationaltask.Lexiconsareex-pensiveto build; therefore,it is preferableto adjustanexistingoneto meetanapplication’sneedsover creatinga new onefrom scratch.This sectiondescribesa way to addassoci-ationalinformationto a hierarchicallystructuredlexicon in orderto createthesaurus-likecategoriesusefulfor thetopicassignmenttask.3

Oneway to labeltexts, whenworkingwithin a limited domainof discourse,is to startwith a pre-definedsetof topicsandspecifythe word contexts that indicatethe topicsofinterest,asin Jacobs& Rau(1990). Anotherway, assumingthata largecollectionof pre-labeledtexts exists, is to usestatisticsto automaticallyinfer which lexical itemsindicatewhich labels,as in Masandet al. (1992). In contrast,the goal hereis to assignlabelsto general,domain-independenttext, without benefitof pre-classifiedtexts. In all threecases,a lexicon thatspecifieswhich lexical itemscorrespondto which topicsis required.Thetopic labelingmethodof this chapteris statisticalandthusrequiresa largenumberofrepresentative lexical itemsfor eachcategory.

Becausea good,large,onlinepublic-domainthesaurusis not currentlyavailable,thissectiondescribesawayto deriveonefrom ahierarchicallexicon. Thestartingpoint for thethesaurusis WordNet(Miller et al. 1990),which is readilyavailableonlineandprovidesa large repositoryof Englishlexical items. WordNet4 is composedof synsets, structurescontainingsetsof termswith synonymousmeanings,thusallowing adistinctionto bemade

3Muchof thework in thissectionappearedin a similar form in Hearst& Schutze(1993).4All work describedherepertainsto Version1.3of WordNet.

Page 46: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 84

assignmentsis amorestringenttestof its accuracy.)Thesetablesalsolist two rowsof scoresfor theprogram.As above,Algorithm-5shows

theinter-indexerconsistency whenonly thetop five categorieschosenby theprogramareusedin the comparison,thusmakinga fair comparisonagainstthe scoresof the judges.Algorithm-7 shows thepercentageagreementwhenthetop sevencategoriesgeneratedbytheprogramareusedin thecomparisonto the judge’s top five categories,thusimprovingrecall at the expenseof precision. Although this numbercannotbe directly comparedagainstthe scoresof the humanjudges,it doesshow that if the algorithmis allowed toincludea few extracategories,it will indeedbring in morerelevantcategories.

judge A.08 C.10 F.32 J.11 J.21A 0.60 0.55 0.55 0.65 0.60B 0.40 0.30 0.60 0.60 0.45C 0.40 0.55 0.55 0.60 0.35D 0.45 0.50 0.60 0.60 0.60E 0.55 0.40 0.50 0.55 0.40

Average 0.48 0.46 0.66 0.60 0.48

Algorithm-5 0.44 0.48 0.36 0.20 0.48Algorithm-7 0.64 0.56 0.48 0.20 0.60

Table4.3: Inter-indexerconsistency scoresfor eachjudgeoneachdocumentin groupA.

judge B.04 E.25 J.10 J.15 J.35F 0.50 0.65 0.45 0.65 0.60G 0.45 0.45 0.45 0.60 0.45H 0.60 0.55 0.55 0.60 0.65I 0.50 0.70 0.50 0.60 0.55J 0.35 0.55 0.45 0.55 0.35

Average 0.48 0.60 0.48 0.60 0.52

Algorithm-5 0.44 0.40 0.28 0.60 0.24Algorithm-7 0.44 0.48 0.48 0.72 0.52

Table4.4: Inter-indexerconsistency scoresfor eachjudgeoneachdocumentin groupB.

Looking more carefully at the tables,we seethe algorithm performedmost poorlyon documentsJ.10,J.11,andJ.35whenrestrictedto the top five categories. A similarproblemoccurredwith bothJ.10andJ.11. Thetop-rankedcategory for J.10for both theprogramandthe indexersis bugs/insects. Similarly, the top-rankedcategory for J.11forboththealgorithmandfour of theindexersis reptiles/amphibians. In bothcases,thejudgesmarkedaslessimportantothercategoriessuchasmeasureandscience. By contrast,in bothcasesthealgorithmlists only otheranimalcategories.Uponreflectionthis behavior is not

Page 47: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 83

Thealgorithmis requiredto choose" categories,andits choicesarecomparedagainstthoseof onejudge,who is assumedto becorrectin all " choices.Thenumberof waystochoose

categoriescorrectlyoutof " choices,fromasetof J categorieswithoutreplacement

(if orderdoesnotmatter),is KML $ON KOPFQRLLSQ $TN .

When

categoriesareidentifiedcorrectly, thepercentagecorrectis %U " ; thereforethe

expectedpercentagecorrectwhen comparingagainstone judge for a randomcategoryassignmentis VXW;ZY

1

$L KTL $[N K[PFQ1LLSQ $[N K[P L N .

In this experiment, "\0 5 and J]0 106. Substitutingin thesevalueswe determinethat the expectedpercentcorrectfor a randomchoosingprocessis 5%. The varianceis^_(a` #cb:dfe 2 . G (g^h(g` #/bidfeC.%. 2 0 0 j 01 G j 0025 0kj 0075.

Sincewehavefiveindependentcomparisons,theaveragescore(themeanof theaverage)is thesumof thefive meansdividedby five, or themeanof any one(0 j 05). Thevarianceof theaverageis thesumof thevariancesdividedby " 2, 0.0015.Thuswe have aGaussianrandomvariablewith mean0 j 05andvariance0 j 0015.Thismeansascoregreaterthan13%(two standarddeviationsgreaterthanthe mean)happenslessthan5% of the time if thecategoriesarechosenat random.

Table4.2presentssummarydatafor the judgesandtwo waysof scoringtheoutputofthealgorithm.Thejudges’averageconsistency scoreis 54%;thealgorithmwhenrestrictedto its top 5 choiceshasa consistency scoreof 39% andwhenallowed to presentits top7 choiceshasan averagescoreof 52%. Thusour resultsperformmuchbetterthanthebaseline,sinceonaveragethealgorithmmatches39%(1 j 96

U5) of thejudges’choices.

Averagefor Averagefor AverageforJudges Algorithm-5 Algorithm-7

GroupA 0.54 0.39 0.50GroupB 0.54 0.39 0.53Average 0.54 0.39 0.52

Table4.2: Overall inter-indexer consistency scores.Algorithm-5 indicatesthe scoreforthe algorithm’s five top-rankedcategories,while Algorithm-7 indicatesthe scorefor thealgorithm’stop sevencategories.

Tables4.3and4.4presenttheseresultsin moredetail. Table4.3shows thepercentageof inter-indexeragreementfor eachdocumentandfor eachjudgein groupA, aswell astheaverageconsistency overall judgesfor eachdocument.Table4.4showsthecorrespondinginformationfor groupB.

Thesetablesindicatethepercentageagreementbetweentheprogram’sscoresandthoseof the judges. Note that whencomparinga judgeagainstthe other judges,comparisonsaremadeagainstfour otherrankings,but whencomparingtheprogramagainstthejudges,comparisonsare madeagainstall five judges. (Including the program’s scoreswhencomparingjudgesagainstjudgeswould bias the comparisonto favor the programbygiving its assignmentsequalweight to the judges’assignments.Excludingtheprogram’s

Page 48: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 82

4.4.3 Analysis of Results

Theinter-indexeragreementwascomputedfor eachjudgeandeachdocument;that is,the averagenumberof category assignmentsthe judgemadein commonwith the otherjudgesfor a particulardocument.More formally, if thereare l judgesmaking " choicesfor eachdocument,

` #/b:dfe ( !:mn.10poq 41 r $tsu & rwv # ( !:mx* mc.zy{ Uw( l G 1.>|}"

where ! m is the list of five categoriesassignedby judge ! to document@ , and # ( ! m * m . isthe numberof categoriesassignedto document @ by both judge ! and judge

. (This

is equivalentto taking theaverageof thepairwisescores.)Note that for this calculation,relative rankingof categoriesis not takeninto account.

Table4.1 shows thecategorieschosenby the judgesandthealgorithmfor two of thetestdocuments.The labelingof documentA.08 hadhigh inter-indexer consistency bothamongjudgesandthealgorithm.For documentE.25,thealgorithmdid not rankmedicine,thejudges’highestcategory, in its top five (rather, it wasrankedeighth),althoughthereisstrongagreementamongthe otherterms;this wasan exceptionalcase(seebelow). Thedocumentin questiondiscussesresearchadvanceson technologyto beusedin a medicalcontext.

judgeA judgeB judgeC judgeD judgeE Algorithm33 government 34 politics 102actions 34 politics 33 government 33government32 legalsystem 33 government 104happening 33 government 34 politics 36finance34 politics 37 work 34politics 104happening 36 finance 32 legal system36 finance 102actions 37work 06 cities 32 legal system 35commerce37 work 36 finance 59 information 29 conflict 29 conflict 29conflict

judgeF judgeG judgeH judgeI judgeJ Algorithm27 medicine 27medicine 27 medicine 27medicine 25body process 87 light02 measure 44technology 44 technology 02measure 27medicine 44technology45 electronics 45electronics 02 measure 44 technology 44technology 45electronics44 technology 52science 87 light 45electronics 45electronic 53physics99 defense 71cell biology 26 body parts 26body parts 100stuff 66machines

Table4.1: Category assignmentsto two documents(A.08 andE.25)by humanjudgesandby thealgorithm.

Oneway to evaluatetheresultsof analgorithmis to compareits performanceagainsta baseline. In this case,we computethe expectedinter-indexer consistency scoreof analgorithm that choosesfrom the category set at random. This baselineis computedasfollows, if we arenot concernedwith relative ranking of categories. The model is oneof chosing " categorieswithout replacementfrom a set of J uniquecategories,whereeachchoiceof category is independentfrom the previous andsubsequentchoices. Theunderlyingdistribution is assumedto behypergeometric(samplingwithout replacement).

Page 49: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 81

� lengthy, but short enoughfor humanjudgesto skimseveralof themin areasonableamountof time,� generalin subjectmatterin orderto matchthethetestcategoryset,� variedin termsof maintopic subjectmatter, so thatavarietyof categorieswill beassigned,and� publically accessibleto facilitate comparisonagainstothercategorizationmethods.

TheBrown Corpussatisfiesthesecriteria; this experimentusesthefirst 300articles.1

Eachdocumentis approximately190 lineslong (or 2030words,on average)andin mostcasesconsistsof anunbrokenstreamof text.2 Thetexts of thedocumentsarecut off afterthefirst 190lines,soin mostcasesreadersdonotseetheentiretext.

4.4.2 The Experiment

Out of these300articles,10 werechosenat random.The10 texts wereseparatedintotwo groups(labeledA andB) of 5 texts each,in orderto reducethe readingloadon thejudges.Eachjudgewasgiventhelist of 106categoriesandthefive texts from eithergroupA or groupB, andthefollowing instructions:

I’d like you to look over the categories briefly, and then read quickly or skimeach text. Each time after you read a text, look at the category list again andchoose the five best categories to describe the text’s main topic(s). List thecategories in ranked order, with best first. Use the category number, and pleaseinclude at least the beginning of the category name so I know you didn’t put thewrong number by accident. The text name occurs at the beginning of each file.

Thejudgesdid notknow thattheir rankingswouldbecomparedagainstthosegeneratedbya computerprogram.

Tensetsof judgmentswerecollected;fivefor eachgroupof texts. Thereisdisagreementin theliteratureabouthow to computeinter-indexerconsistency (e.g.,Rolling (1981),Hen-zler(1978)),however, in mostcasesthis is donein apairwisemanner. Weareinterestedinhow closelytheprogrammatchesthehumanjudgmentsonaverage.

1DocumentnumbersA.01-F.48,H.01-H.30,J.01-J.80.2A few documentsconsistedof severaldistinctarticlescombined,thefirst blendingdirectly in to thenext.

Page 50: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 80

example,Masandet al. (1992), Jacobs(1993), and Hayes(1992) usehuman-assignedlabelsbothfor trainingandjudgingtheirclassificationsystems.

Whenthereexistsalargetrainingbaseof examples,it is saferto assumethatcomparingagainstonejudgmentperdocumentis accurate.However, becauseno largetrainingsetisavailablefor this task,andbecauseinter-indexerconsistency tendsto below (Bates1986),a betterevaluationmetric is to comparethe inter-indexer consistency of the algorithmwith that of humanjudges. (Inter-indexer consistency is the averagenumberof categoryassignmentsa judgemakesin commonwith the otherjudgesfor a particulardocument.)Furthermore,Cooper(1969)showsthatin arestrictedcaseat least,increasedinter-indexerconsistency leadsto increasedexpectedsearcheffectiveness,andRolling (1981)providesmoresupportingevidenceto thiseffect.

FollowingGaleet al. (1992a),theperformanceof thealgorithmisevaluatedagainstbotha lowerboundandanupperbound.Thelowerboundrepresentstheminimalperformancethatany algorithmoughtto beableto surpass.Often this boundaryis whatwould resultif an algorithm alwaysmadethe most likely choice, e.g., for a part-of-speechtagger,a lower boundmight be the percentagecorrectobtainedby alwaysassigningthe mostlikely part-of-speechcategory for eachword. Useful lower boundsarenot alwayseasilyformulated;sometimesan algorithm’s resultsshould just be comparedagainstwhat analgorithmmaking randomassignmentswould produce(surprisingly, it is not infrequentthat proposedalgorithmsdo not performmuchbetterthanchance). Sinceno priors oncategory assignmentsareavailablefor evaluationof the category setdescribedhere,thelower boundor baselinein this evaluationis the performanceof an algorithm makingrandomchoices.Often in computationallinguisticsalgorithmstheupperboundis thatofhumanperformance;thealgorithmshouldnotbeexpectedto dobetterthanahumanwouldona taskwith agoalof matchinghumanintuitions. In thisevaluation,humaninter-indexerindexing, asdescribedabove, is theupperboundfor evaluation.

4.4.1 The Test Set

Thetexts usedin theevaluationexperimentswerechosento satisfyseveraldesiderata.They are:

Page 51: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 79

which termsoutsideof thosealreadyspecifiedaregoodindicatorsof thecategory. Riloff& Lehnert(1992)usea pre-labeledtrainingsetto extractdefiningfeaturesfrom a frame-like representationfor eachdocument,in a framework in which a single phrasecan beenoughto indicatethepresenceof a category. Jacobs(1993),in a framework combiningknowledge-basedandstatisticalinformation,exploresseveraldifferentwaysto determinegood indicators,including weighting termsaccordingto a mutual informationstatistic,usingexceptionlists, andfinding termsthat tendto surroundcategory termsbut arenota part of the category themselves. Fung et al. (1990)usinga probabilisticnetwork forcategorization,requiresusersto selectwhich featuresfrom a setindicatethe relevanceoftrainingdocumentsandthenautomaticallydeterminestheweightstoplaceonthelinks. Thedisadvantageto all theseapproachesis thatthey requirepre-labeledtextsor userjudgmentsfor thetrainingstep.

Many algorithmshave beendevelopedthat useco-occurrenceinformationfor deter-mining category membershipor to build thesaurusclasses.Crouch(1990),Grefenstette(1992),Salton(1972),Sparck-Jones(1986),andRuge(1991)all useco-occurrenceinfor-mationderivedfrom corporato determinehow to expandquerieswith relatedterms,andshow that this informationcan improve retrieval. (But seePeat& Willett (1991) for acriticism of this kind of approach.) Deerwesteret al. (1990) useco-occurrencetermsamongdocuments(compressedwith multivariantdecomposition)to determinesemanticrelatednessamongdocuments.Co-occurrenceinformationhasbeenfoundto beusefulforavarietyof tasksin computationallinguisticsaswell (e.g.,Church& Hanks(1989),Smadja& McKeown (1990),Justeson& Katz(1991)).

An advantageof the Yarowsky weightingschemeis that it usesco-occurrenceinfor-mation to classify termsinto pre-defined,intuitively understandableclasses,asopposedto classesderived from the data. Although categoriesor classesderived from dataareusefulfor many kindsof applications,intuitive categoriesmaybemoreappropriatewheninterfacingbetweenthesystemandtheuser. This suppositionis visited in moredetail inChapter5.

Anotheradvantageof thealgorithmis thatit canaccommodatemultiple category sets.Categorizationalgorithmsbasedonclusteringcanonly presentoneview onthedata,basedon the resultsof theclusteringalgorithm,but asshown above, documentsmaybesimilaron only oneout of several main topic dimensions.Algorithms that train on pre-labeledtexts canalsorepresentmultiple simultaneouscategories,but areconfinedto usingonlythecategorysetsthathavebeenpre-assigned(sincein mostcasesthousandsof pre-labeleddocumentsarenecessaryto train thesealgorithms).

4.4 Evaluation of the Categorization Algorithm

A commonway to evaluatea categorizationalgorithmis to compareits labelingswiththoseassignedby humancategorizers. For sometestcollectionsa “correct” setof labelsalreadyexists, and the program’s resultscan be measureddirectly againstthese. For

Page 52: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 78

Tiling

Training forCategorization

coherent textunits

word counts

categorizedwords

Figure4.2: A proposedtrainingloop: TextTiling, usingonly termrepetitioninformation,providescontext window informationfor thecategorizationalgorithmwhich thensuppliestermcategoryinformationtoimprovetheresultsof TextTiling, whichaidsin thereestimationof thepriorsfor thecategorizationalgorithm,andsoon.

Paik (1992)useSubjectCodeassignmentsfrom theLDOCE dictionary, creatingin effecta setof generalcategories.Thealgorithmpresentedheremakesaprobabilisticestimateofthe likelihood of a category given the termsthat occur. In contrast,the systemof Liddy& Paik (1992)usesheuristicsto determineword sensesbasedon how many wordsthatcanbe assigneda particularcodeoccurin a sentence,aswell ashow likely it is for thecandidatecodesin thesentenceto co-occur. Thusit alsodoesnot requirepre-labeledtextsbut it doesrequirea largenumberof wordsto havebeenassignedto categoriesin advance.The categorizationalgorithmdescribedherealso requiressometermsto be assignedtoeachcategory in advance,but it automaticallychoosesadditionaltermsfrom the corpusto act as strongindicatorsfor eachcategory. Thus it shouldbe moreadaptableto newcategorysets,thatis, category setsthatcharacterizespecializeddomainssuchasacademiccomputerscience.It wouldbeusefulto runanexperimentcomparingtheresultsof thetwoalgorithms.

Othercategorizationalgorithmsalsodealwith the issueof choosingsalientfeatures.Lewis (1992) definesfeature selection as the processof choosing,for eachcategory, asubsetof termsfrom an indexing languageto be usedin predictingoccurrencesof thatcategory. Heusesamutualinformationstatisticwithin aprobabilisticframework,choosingthe highestscoringtermsfor eachcategory to act as indicatorsfor the presenceof thatcategory. Thisapproachto termweightingis themostsimilar to thatdescribedhere.

Many knowledge-basedclassificationsystemsalso recognizethe needto determine

Page 53: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 77

alsoweightsthetermsin this manner),thusrequiringtwo passesthroughthetrainingdata.Insteadof smoothingestimatesfor infrequentterms,this implementationsimply excludesinfrequentterms(termswhosefrequency is lessthana threshold)from contributing ev-idence. This is donebecauseit is unlikely that a very infrequentterm will be able toprovidereliableevidencefor thecategory, andthusit is excludedfrom makingany kind ofcontributionatall.

In thecurrentimplementationof thealgorithmatermwindowsurroundinganinstanceofawordthatisamemberof acategoryiscountedequallyasevidencefor all of thecategoriesof which theword is a member. However, it would be interestingto incorporatethis anddeterminewhetheror not re-estimationcausedresultsto improve. Anotherpossibilityis tousethere-estimationstepto adjustthebiasof thealgorithmto a corpusdifferentthantheoneinitially trainedon.

Anotherissueis thatof context window size. Galeet al. (1992b)find thatsometimeswordseven10,000positionsaway from the target termareusefulfor training; Yarowsky(1992)usesa fixedwindow of 100words. Thatwindow sizewasalsousedin thecurrentimplementationof thealgorithm;however, a moremeaningfulway to specifythecontextwindow would be to usecoherentmulti-paragraphunits asdiscoveredby the TextTilingalgorithmof Chapter2. The re-estimationalgorithmcanalsoplay a role in determininganappropriatewindow size,asfollows (seeFigure4.2). Thetiling algorithm,usingonlyterm repetition,determinesthe windows to be usedasinput to the training phaseof thecategorizationalgorithm.Thecategorizationalgorithmis thenusedtoassigndisambiguatedlabelstomany of thetermswhicharethenre-inputtothetiling algorithm,whichpresumablycannow generatemoreaccuratetile information,andsoon. Thetrainingloop ideahasnotyetbeenimplemented.

In thetrainingphaseof thecurrentimplementation,if awordisamemberof acategory,thenthatwordis notallowedto countasevidencefor thecategory. Thismakesmoresensefor thedisambiguationalgorithmthanfor thetopic labelingalgorithm,but in bothcasestheword shouldbeableto countasevidencefor thecategory it is a memberof, accordingtosomeprior probabilityof its tendency to representthatcategory versusany othercategoryof which it maybea member.

4.3.5 Related Work and Advantages of the Algorithm

Thereexist other systemsin which multiple categoriesare assignedto documents,e.g., Masandet al. (1992), Jacobs& Rau (1990), Hayes(1992). However, unlike themethodsuggestedhere,thesesystemsrequirelargevolumesof pre-labeledtexts in orderto performtheseclassifications.Larson(1992),Larson(1991)presentsanalgorithmthatautomaticallyassignsLibrary of CongressClassificationnumbersto bibliographicrecordsconsistingof titles and subjectheadingsafter forming clustersbasedon training fromexisting records. The methodworks well but requirespre-definedsubjectheadingsasattributesfor classification.

Theapproachof Liddy & Paik (1992)is mostsimilar to thatpresentedhere.Liddy &

Page 54: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 76

membersof ~��[�f� . Henotes,however, thatthisestimateis unreliablewhen E is infrequentin thecorpus,andcorrectsfor this by employinga smoothingalgorithmdescribedin Galeet al. (1992b).

Onceall of thecomputationsfor � ( E 2 ~��[�f�,. havebeencomputed,disambiguationcantakeplace.Yarowsky combinestheevidencesuppliedby thewordssurroundinganinstanceof theword thatis beingdisambiguatedasfollows:

Argmax�}���n� 4= $ P 9<� P����t�,� log� ( E 2 ~��[�f�,.g� ( ~[�[�f��.� ( E .

wherea context of 50 wordsis allottedon eithersideof thetargetword (additionis usedsincethe formula takesthe logs of the evidenceweights). Yarowsky assumesa uniformdistributionfor � ( ~[�[�f��. , andnotesthat � ( E . canbeomittedaswell sinceit will noteffecttheresultsof themaximization.

Thisalgorithmdoesnotenforcemutualexclusivity onevidencefor differentcategories,althoughweightassignedto onecategory doesdetractfrom weightthatcanbeassignedtoany othercategory(sincethefrequency of co-occurrenceof awordwith acategorymemberis dividedby theoverall frequency of theword). Thelack of mutual-exclusivity is usefulin thatit allowsoneword to providepartialevidencefor multiplecategories.

Training proceedsby first collectingglobal frequency countsover a corpus. For thecurrentimplementation,trainingwasdoneonGrolier’s American Academic Encyclopedia( � 8 j 7M words).In thecurrentimplementationof thealgorithm,termsarecheckedagainstWordNet (Miller et al. 1990) in order to placethem in a “canonicalized”form. Wordsthatarelistedon a 454-word“stoplist,” (i.e., a list of closed-classwordsandotherhighlyfrequentwords)arenotusedin thecalculationof evidencefor categorymembership.

If a pair of adjacentwordsmatchesa compoundcontainedin WordNet,thenthatpairis considereda term, insteadof as the individual words that compriseit. If the worddoesnotparticipatein a two-membercompound,thenits membershipin WordNetaloneisinvestigated.If this checkfails, thenthe term’s inflectionsareremovedusinga modifiedversionof WordNet’s morphologicalanalyzer, and the stemmedversionis lookedup inWordNet. In caseof failure, the next two modificationsare conversionof the term’scharactersto lowercaseandreapplicationof morphologicalstemming.If all elsefails, theword is recordedin its original form.

As mentionedabove, Yarowsky’s algorithmis designedto performword disambigua-tion. After training hasbeencompleted,the term weightscanbe usedto classifya newinstanceof atermthatisamemberof oneormorecategoriesinto thecategorywith themostcontextualevidence.However, categoryassignmentis computedsomewhatdifferently.

Evidencefor categorymembershipis determinedby evaluatingco-occurrenceinforma-tion within a fixed-lengthwindow of termssurroundingeachinstanceof a targetterm. Topreventtheevidencesuppliedby frequenttermsfrom dominatingtheevidencesuppliedbyinfrequentterms,theevidencecontributedby a particularmemberof acategory is normal-izedby thenumberof timesthattermoccursin thecorpusasa whole. (Yarowsky (1992)

Page 55: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 75

4.3.3 Lexically-Based Categories

For thepurposesof this algorithm,a category is definedby thesetof lexical itemsthatcompriseit. The implementationusesa category setderivedfrom WordNet(Miller et al.1990),a large,hand-built online repositoryof Englishlexical itemsorganizedaccordingto several lexico-semanticrelations. The implementationdoesnot useRoget’s categoriesbecauseat the time of writing they areno availableto thepublic in electronicform. Thealgorithmusedto derivetheWordNet-basedcategoriesis describedin Section4.5,with thegoal of achieving wide coverageusinggeneralcategories. A moderatesizecategory setwasusedin orderto facilitatecomparisonsagainstjudgementsmadeby humansubjects(whowouldbeoverwhelmedby too largea categoryset).

The algorithmworks by automaticallydetermining,for eachcategory, which lexicalitemstendto indicatethepresenceof thatcategory. Theevidencefor presenceof acategoryis determinedby not only by thepresenceof the lexical itemsthatmakeup thecategory,but alsoby termsthathave beenfound to co-occurin a salientmannerwith thecategoryterms(describedin detail below). For example, the “vehicles” category, consistingofnamesof kindsof vehicles,couldbeindicatedby termsindicatingwherevehiclesareused,e.g.,“road”, “ocean”,etc. Ideally, thecategory itself might containtermsthatindicatethesemanticframein whichthecategoryisused.Forexample,the“athletics”categorycontainstermsaboutathletes,playingfields,andsportsimplements,aswell asnamesof sports.It isdifficult to determinewhereto draw a line betweencategory-specifictermsandtermsthatoccurmoregenerally. However, thealgorithmhelpsdecidethis by indicatingwhich termsoutsidethecategoryneverthelessco-occurwith it significantlyandto theexclusionof othercategories. Thus,in somecasesthealgorithmdiscoverstermsthatsupporttheframe-likemeaningof thecategories.

Sparck-Jones(1971)discussesat lengththedifferencebetweensynonymsandseman-tically relatedtermsin a category definition. For example,termsgroupedwith “desire” inRoget’s Thesaurus include“wish”, “fancy”, and“want”, whichcanbecalledsynonyms. Incontrast,termsgroupedwith “navigation” include“boating”, “oar”, and“voyage”; thesearenot synonyms but aresemantically, associationallyrelatedto the navigation schema.Sheconcludesin Sparck-Jones(1986)that thesemantic-basedclassesaremoreeffectivefor informationretrieval, althoughdoesnotclaimto verify this rigorously.

4.3.4 Determining Salient Terms

Yarowsky 1992definesasalientwordas“onewhichappearssignificantlymoreofteninthecontext of acategory thanatotherpointsin thecorpus”(p 455). For example,theterm“lift” canbesalientfor themachinesenseof “crane”butnotfor thebirdsense.Heformalizesthiswith amutual-information-likeestimate:� ( E 2 ~����A�,. U � ( E . , theprobabilityof awordE occurringin the context of theRogetcategory ~��[�f� dividedby the probabilityof thetermoccurringin thecorpusasawhole.Yarowsky notesthat � ( E 2 ~��[�f�,. canbecomputedby determiningthe numberof times E occursin the context surroundingtermsthat are

Page 56: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 74

to determinewhichsenseof thetargettermis in use(usingavariationof Yarowsky’s(1992)algorithm,seebelow). AlthoughI abandonedthis approach,theactualcurrentimplemen-tation processeswhole windows of wordsat a time, in effect periodically“probing” thedocument.Dependingon thewindow sizeandthefrequency of theprobes,this canresultin countingeachterma constantnumberof times,ratherthanonetime only, but this doesnotchangetheresultingrankingof thecategories.

4.3.2 Yarowsky’s Disambiguation Algorithm

The topic assignmentalgorithmdescribedhereis a modificationof a disambiguationalgorithmdescribedin Yarowsky (1992). Yarowsky definesword sensesasthecategorieslistedfor aword in Roget’s Thesaurus (FourthEdition),whereacategory is somethinglikeTOOLS/MACHINERY. For eachcategory, thealgorithm

1. Collectscontexts thatarerepresentativeof theRogetcategory2. Identifiessalientwords in the collective contexts and deter-

minesweightsfor eachword,and3. Usestheresultingweightsto predicttheappropriatecategory

for a polysemousword occurringin a novel text. (Yarowsky1992)

In otherwords,thedisambiguationalgorithmassumeseachmajorsenseof ahomographis representedby a different thesaurus-likecategory. Therefore,an algorithm that candeterminewhich category an instanceof a termbelongsto canin effect disambiguatetheterm. The disambiguationis accomplishedby comparingthe termsthat fall into a widewindow surroundingthe target term to contexts that have beenseen,in a training phase,to characterizeeachof thecategoriesin which thetargettermis a potentialmember. Thetrainingphasedetermineswhich termsshouldbeweightedhighly for eachcategory, usinga mutual-information-like statistic. Thetrainingdoesnot requirepre-labeledtexts; ratherit relieson the tendency for instancesof differentcategoriesto occurin differentlexicalcontexts to separatethesenses.After thetrainingis completeda word is assigneda senseby combiningtheweightsof all thetermssurroundingthetargetwordandseeingwhichofthepossiblesensesthatwordcantakeonhasthehighestweight.

I extendthisalgorithmto thetext categorizationproblemasfollows. Insteadof choosingfrom thesetof categoriesthatcanbeassignedto a particulartargetword,this new versionof thealgorithmmeasureshow muchevidenceis presentfor all categories,independentlyof whatwordoccursin thecenterof thecontext beingmeasured.After theentiredocumenthasbeenprocessed,thecategorieswith themostevidenceareidentifiedasthemaintopiccategoriesof thetext. This algorithmis basedon theassumption,discussedin Chapter3,thatmaintopicsof a text arediscussedthroughoutthelengthof thetext. Thealgorithmisdescribedin moredetail in thenext two subsections.

Page 57: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 73

4.3 Automatic Assignment of Multiple Main Topics

This chapterdescribesa mechanismfor assigningmultiple main topics to lengthyexpositorytexts. Thealgorithmis a modificationof a disambiguationalgorithmdescribedin Yarowsky (1992). It requiresa training phasethat determineswhich termsshouldbeweightedhighly asevidencefor eachcategory. The trainingdoesnot requirepre-labeledtexts;ratherit reliesonthetendency for instancesof differentcategoriestooccurin differentlexical contexts to separatethe evidence. Whenassigningtopicsto a text, the algorithmmeasureshow muchevidenceis presentfor every category; the categorieswith the mostevidenceareconsideredto bethemaintopiccategoriesof thetext.

Thecategorizationalgorithmis statisticalin natureandis basedon theassumptionthatmaintopicsof atext arediscussedthroughoutthelengthof thetext. Thusalthoughit looksat the evidencesuppliedby individual lexical items,it doesnot takethe structureof thetext into account,e.g.,how thelexical itemsarerelatedto oneanothersyntacticallyor bydiscoursestructure. The algorithmis successfulat identifying schema-likecategories;itidentifiestermsassociatedwith thecategoriesthatarenotnecessarilyoriginallyspecifiedasmembersof thecategories.However, becausethecategoriesarepre-defined,thealgorithmcannotrecognizeor producenovel labels.For this reason,theresultsof thecategorizationalgorithmshouldbe usedin conjunctionwith termsthat occurfrequentlythroughoutthetext whencharacterizingthetexts’ content.Fixedcategoriesshouldplayonly apartialrolein thecharacterizationof thecontentsof thetext.

This chapteralsodiscussesan approachto creatingthesaurus-likecategoriesfrom anexistinghand-built lexicon,WordNet(Miller et al. 1990).Thefirst stepis analgorithmforbreakinguptheWordNetnounhierarchyinto smallgroupsof relatedterms,andthesecondstepdetermineswhich groupsto combinetogetherin an attemptto createschema-likecategories.Thisstepuseslexical associationinformationfrom a largecorpusto determinewhichgroupsaremostsimilar to oneanother. Thisprocedureyieldsasetof categoriesthatcanthenbeusedasgeneralcategory labelsfor lengthyexpositorytexts.

4.3.1 Overview

Thecategoryassignmentalgorithmworksasfollows. A measureof associationbetweenwordsandcategoriesis foundby trainingon a largetext collection;thetrainingalgorithmis describedin thefollowing sections.This measureof associationis usedto characterizethewordsof thedocumentto which categoriesareto beassigned.Thealgorithmlooksuphow stronglyassociatedeachword in thetext is with all of thecategoriesin thecategoryset.Thescoresfor eachcategoryareaddedtogether, andthetopscoringcategories,subjectto a user-specifiedcutoff, arereportedaftertheentiredocumenthasbeenprocessed.Theassociationmeasureis a normalizationof � ( � $ 2 E & . asshown below.

Earlier I experimentedwith algorithmsthat tried to determinewhich sense(category)of awordwasbeingusedbeforeallowing thatwordto contributeto evidencefor theoverallcategorizationof thealgorithm.Thisrequiresusingawindow of wordssurroundingaword

Page 58: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER4. MAIN TOPICCATEGORIES 72

vegetablesbiologyfoodmedicine medicine

biologytroublemilitary

foodvegetablesplantsalcohol

financecommercelawgovernment

A

B

C

D

E

vehiclesliquidstechnologychemistry

Figure4.1: Main topic categoriesassignedby the algorithmdescribedin this chaptertotextsthatcontaintheterm“contaminant”at leasttwicein asmallcorpusof newspapertext.

(Church& Liberman1991)of articlesfrom the Wall Street Journal in which the string“contaminant”occursat leasttwice. Glancingat thesewe cangeta feelingfor the “gist”of eacharticle. For example,documentsA andB areassignedcategoriesrelatingto food,while documentC is assignedtwo very different categories– medicineand military –becausethearticlediscussestheaccidentalreleaseof anagentfor biologicalwarfareandthesubsequentmedicaldamagecontrolefforts. DocumentD discussescontaminantsin atechnicalcontext while documentE discussescontaminantsin afinancialcontext; in otherwords,ratherthanfocusingon themedicalor environmentalaspectsof a contamination,itfocusesonassociatedbusinessandlegalcosts.

Note that this example,especiallydocumentC, highlightsanotherpoint: texts, espe-cially long texts, arenot alwaysbestrepresentedas one topic from one semanticclass.Ratherthey areoftenabouttwo or morethemesandsomerelationshipamongthese.Thusclassifyingdocumentsstrictly within a topichierarchicallycanbemisleading,becausethemultiple themesthatco-exist arenot necessarilyonesthatarecommonlyconsideredto bein the samesemanticframe. Theseandrelatedissuesarediscussedin greaterdetail inChapter5.

Thischapteris structuredasfollows: Section4.3describesthecategorizationalgorithm,Section4.4presentsanevaluationof thealgorithm,andSection4.5describestheway thegeneralthesaurus-likecategory setwasacquired.

Page 59: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

71

Chapter 4

Main Topic Categories

4.1 Introduction

Thischapterpresentsanalgorithmthatautomaticallyassignsmultiplemaintopiccate-goriesto texts,basedoncomputingtheposteriorprobabilityof thetopicgivenits surround-ing words,without requiringpre-labeledtraining dataor heuristicrules. The algorithmsignificantlyoutperformsa baselinemeasureand approachesthe levels of inter-indexerconsistency displayedby nonprofessionalhumanindexers. Thechapteralsodescribestheconstructionof ageneralcategory setfrom anexisting hand-built lexical hierarchy.

Theapproachtocategorizationdescribedhereisonein whichonly thesimplestassump-tionsaremadeaboutwhat it meansto categorizethe contentsof a text. This is doneforthepurposesof robustness,scalability, andgenretransferability. More reasonableresultscouldbeobtainedfrom morestructuredanddomain-specificanalysesof thetext, but at thecostof notallowing for wideapplicability.

4.2 Preview: How to use Multiple Main Topic Categories

Thecapabilityto automaticallyassignmain topic labels(in this andthenext chapter,the terms“categories”, “main topics”, and“labels” areusedinterchangeably)leadsto anew paradigmfor browsing the contentsof full-length texts: the labelscan be usedtohelpcontextualize theresultsof a query;i.e.,show theuserthetopicsthatcharacterizethedocumentsassociatedwith the resultsof a query. In Chapter5, I explore the hypothesisthatusersneedmorecontextual informationwhendealingwith full-length texts thanwithabstractsandshorttext, in partbecausesimilarity informationis lessusefulwhencomparinglengthydocuments.HereI presentoneexampleof this idea.

If the resultsof a user’s query are situatedwith respectto the main topics of thedocuments,a userwith only a vaguenotionof whatcontext thetermshouldappearin canbrowsetheoutputof thecategorizertofindappropriatetexts. Forexample,Figure4.1showsautomaticallyassignedmain topic categoriesfor five texts from the ACL/DCI collection

Page 60: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 70

the representationdifficult to interpret. Another extensionto be madeto the existingimplementationof TileBarsis improvementof thesimplepatternsortingheuristic.Studiesshouldto be doneto determinewhat kinds of patternsortingsaremost informative. InthefuturetheTilebarsshouldalsobeevaluatedin termsof theirusein relevancefeedbackandwith respectto how usersinterpretthemeaningof thetermdistributions.Theanalysisshouldcompareusers’expectationsaboutthemeaningof thetermdistributionsagainsttheanalysisshown in thedistribution chart. It maybeusefulto determinein whatsituationstheusers’expectationsarenotmet,in hopesof identifyingwhatadditionalinformationwillhelppreventmisconceptions.

Informationaccessmechanismsshouldnotbethoughtof asretrieval in isolation.Cuttinget al. (1990)advocatea text accessparadigmthat“weavestogetherinterface,presentationandsearchin a mutuallyreinforcingfashion”; this viewpoint is adoptedhereaswell. Forexample,theusermight sendthecontentsof thePassingReferenceswindow of a TileBarsessionto a Scatter/Gathersession(Cutting et al. 1993), which would then clusterthedocuments,thusindicatingwhatmaintopicsthepassingreferencesoccurredin. Theusercould selecta subsetof the clustersto be sentbackto the TileBar session.This kind ofintegrationwill beattemptedin futurework.

Page 61: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 69

alsocombinethescoresfor adjacent30-wordsegmentsin casethey breakthedocumentinaninopportuneposition,andthenreportthebest J combinedscores.Theusercanchooseto seeeitherthe bestsectionsor the headsof the bestdocuments.This simplemethod,performedon texts from theDow Jonesnewswireservice,consistingof about1 Gigabyteof newswires,magazines,newspapers,amongothers,achievesgoodresultsafterextensivetesting.Theauthorscite a precision-recallproductof 0.65on their taskbut do not furtherelaborateon this claim (it would be a challengeto accuratelydeterminerecall on suchacollectionunlesssomekind of sampling-basedestimationis used).

Hahn(1990)haseloquentlyaddressedthe needfor imposingstructureon full-lengthdocumentsin orderto improve informationretrieval, but proposesa knowledge-intensive,stronglydomaindependentapproach,which is difficult to scaleto sizabletext collections.Croftet al. (1990)describeasystemthatallowsusersdirectaccesstostructuredinformation.Rus& Subramanian(1993)makeuseof certainkindsof structuralinformation,e.g.,tablelayout,for informationextraction.

Ro (1988a)hasperformedexperimentsaddressingtheissueof retrieval from full textsin contrastto usingcontrolledvocabulary, abstracts,andparagraphsalone. PerformingBooleanretrieval for asetof ninequeriesagainstbusinessmanagementjournalarticles,Rofoundthatretrieving againstfull text producedthehighestrecallbut thelowestprecisionofall themethods.In subsequentexperiments,Ro (1988b)tried variousweightingschemesin an attemptto show that retrieving againstfull text would performbetterthanagainstparagraphsalone,but did notachieve significantresultsto this effect.

3.6 Conclusions

This chapterhas discussedretrieval from full-text documents. I have shown howrelative term distribution can be useful information for understandingthe relationshipbetweena queryandretrieveddocuments.I have generalizedthecontrastbetweenmaintopics and subtopicsto an analysisof all the possiblecombinationsof term frequencyand distribution betweentwo term setsand hypothesizedabout the usefulnessof eachdistributionalrelationship.

I have also introduceda new display device, called TileBars, that demonstratestheusefulnessof explicit termdistributioninformation.Therepresentationsimultaneouslyandcompactlyindicatesrelativedocumentlength,querytermfrequency, andquerytermdistri-bution. Thepatternsin acolumnof TileBarscanbequickly scannedanddeciphered,aidingusersin makingfast judgmentsaboutthe potentialrelevanceof the retrieveddocuments.TileBarscanbesortedaccordingto theirdistributionpatternsandtermfrequencies,aidingtheusers’evaluationtaskstill more.Two queriesfrom theTRECcollectionwereanalyzedusingTileBarsandit wasshown thattherelevantdocumentsfor eachquerydemonstratedradicallydifferentpatternsof distributionof thechosenqueryterms.

Currentlyonly two term setsarecontrastedat a time; this canbe easilyextendedtothreeor four. It is most likely the casethat any more than four term setswill make

Page 62: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 68

into motivatedsegments,retrievethetop-scoring200segmentsthatmostcloselymatchthequery(accordingto thevectorspacemodel),andthensumthescoresfor all segmentsthatarefrom thesamedocument.Thiscausesthepartsof thedocumentsthataremostsimilartothequeriestocontributeto thefinalscorefor thedocument.Thisexperimentwasperformedon a smallsubsetof theTRECZIFF collection(274documentsof at least1500wordsoftext each). Similarity searchon segmenteddocumentswasfound to performbetterthanfull documents,andtheapproachof combiningthescoresfor thetop200segmentsworkedsignificantlybetterthaneitherfull textsorsegmentsalone.To beexploredis thequestionofwhatportionsof thedocumentscontributetothesum–arethereseveraldifferentdiscussionsaboutthesamesubtopic,or differentpassagesof thetext correspondingto differentpartsof thequery?Perhaps,asseenin theexamplesin Section3.5.1,differentexplanationsholdfor differentqueries. An examinationusinga modifiedversionof TileBarsshouldhelpelucidatetheseissues.

Moffat et al.

Moffat et al. (1994)andFulleret al. (1993)arealsoconcernedwith structuredretrievalfrom long texts, aswell asefficiency considerationsrequiredfor indexing documentsub-parts.Moffat et al. (1994)performedaseriesof experimentsvaryingthetypeof documentsubpartthatwascomparedandthewaythesubparts’wereusedin theranking.

Interestingly, Moffat et al. (1994)foundthatmanuallysuppliedsectioninginformationmayleadto poorerretrieval resultsthantechniquesthatautomaticallydividethetext. Theycomparedtwo methodsof dividing up long texts. The first consistedof the premarkedsectioninginformationbasedon theinternalmarkupsupplied(presumablyby theauthor)with the texts. The seconduseda heuristicin which small numbersof paragraphsweregroupedtogetheruntil they exceededa size threshold. The resultswere that the small,artificial multi-paragraphgroupingsseemedto perform better than the author-suppliedsectioninginformation. More experimentsarenecessaryin this vein to firmly establishthis result,but it doeslend supportto the conjecturethat multi-paragraphsubtopic-sizedsegments,suchasthoseproducedbyTextTiling, areusefulfor similarity-basedcomparisons.

3.5.3 Other Approaches

Anotherrecentpieceof workonpassageretrieval(Mittendorf& Schauble1994)createsa HiddenMarkov Model representationof thetext andthequery. In orderto evaluatetheresultsthe authorsconcatenatea sequenceof abstracts(from the MEDLAR collection,which consistsof 1003 abstractsand 30 queries)and try to both recognizethe originalboundariesof thedocumentsaswell asfind thedocumentsthatarerelevantto thequery.

Otherresearchershaveapproximatedlocalstructurein longdocumentsby breakingthedocumentsintoeven-sizepieces,withoutregardfor any boundaries.Stanfill& Waltz(1992)reporton sucha technique,usingthe efficiency of a massively parallelcomputer. Theydividethedocumentsinto30-wordsegmentsandcomparethequeriestoeachsegment.They

Page 63: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 67

Salton, Buckley, and Allan

Oneway to get an approximationto subtopicstructureis to breakthe documentintoparagraphs,or for very long documents,sections. In both casesthis entailsusing theorthographicmarkingsuppliedby theauthorto determinetopicboundaries.

Salton,Buckley, andAllan (1991,1993,1994)have examinedissuespertainingto theinterlinking of segmentsof full-text documents.In the applicationsthey have described,Saltonet al. focusonfindingsubpartsof alargedocumentthateitherhavepointersto otherdocuments(asin “SeeAlso” referencesin theencyclopedia,or repliesto previouslypostedemail messages),or are very similar in content. Theselinks areusedfor the purposesautomaticpassagelinking for hypertext. They focusmoreonhow to find similarity amongblocksof text of greatlydiffering length,andnot so muchon the role of the text blockin the documentthat it is a part of. They find that a goodway to ensurethat two largersegments,suchastwo sections,aresimilar to oneanotheris to makesurethey aresimilarbothgloballyandlocally.

Their algorithmsensurethat a documentis similar to a query at several levels ofgranularity: over theentiretext, at theparagraphlevel, andat thesentencelevel. (In thiswork, whenappliedto encyclopediatext, queriesusuallyconsistof encyclopediaarticlesthemselves.) For two sectionsto besimilar, they mustbesimilar overall, at theparagraphlevel, andat the sentencelevel. To accommodatefor the fact thatmostparagraphsdifferin length,they normalizethetermfrequency componentfor thecomparisons.Their resultsshow that this procedureis more effective than using full-text information alone. Thisstrategy, especiallythe sentence-level comparison,serves as a form of disambiguation,sinceit forcestermsthat have more than one senseto be usedtogetherin their sharedsenses.Saltonet al. have found this approachto work quite well for the encyclopediadata,usingthepre-existing See-Alsolinks astheevaluationmeasure.(They point out theproblemswith this asanevaluationmeasure:sincetheencyclopediais parsimoniouswithits referencelinks, many links thatcouldreasonablybepresentaredeliberatelyleft out toavoid clutter.)

However, whenthey appliedthesametechniqueto theTRECcollection,they foundtheresultswerenot improvedby theglobal/localstrategy (Buckley et al. 1994).They attributethis to the lack of needfor disambiguationamongthe TREC queries,sincethe datasetsinvolvedaremorehomogenousthanthoseof theencyclopedia.

Otherreasonsmightbethatthestructureof theTRECqueriesdonotreflectthestructureof thedataset,asis thecasewith theencyclopediatext, andthattheTRECdatasetis muchmorevariedandirregularthanis theencyclopediatext.

Hearst and Plaunt

An alternative approachis presentedin Hearst& Plaunt(1993), which presentsanexperimentthatdemonstratestheutility of treatingfull-length documentsascomposedofa sequenceof locally concentrateddiscussions.The strategy is to divide the documents

Page 64: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 66

Figure3.10: TileBarsfoundin responseto a simplifiedversionof TRECtopic description034. TermSet1 = isdnandTermSet2 = applicationstrategy.

Page 65: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 65

<narr>Narrative:

Toberelevant,adocumentmustidentifyacompany’sstrategy for usingIntegratedSer-vicesDigital Networks(ISDN)or buildingor usingapplicationswhichtakeadvantageof ISDN.

<con>Concept(s):

1. ISDN

2. Strategy, Applications,Products,

3. Networks

<fac>Factor(s):

<def>Definition(s):

IntegratedServicesDigital Networks(ISDN) - An internationaltelecommunicationsstandardfor transmittingvoice,videoanddataover adigital communicationsline.

Thereare49 documentsjudgedrelevantto this topicdescription.By convertingit to asimpleTileBarquery, of theform: TermSet1: ISDN andTermSet2: applicationstrategy,we find TileBar descriptionslike thoseshown in Figure3.10. In this caseit is usefultousethesortedTileBar representation.Interestingly, all of thedocumentsin the“Both TermSets”window, andall thedocumentsin the“TermSet1” window arejudgedto berelevant.Only document525in the“TermSet2” window is relevant,andonly two documentsin the“PassingReferences”window arerelevant.

Theseexamplesgraphically illustrate how differencesin term distribution can havedifferent effects on relevancejudgments. In topic description034, it is importantthatthe term ISDN be frequentand well-distributed throughoutthe text, whereasin topicdescription005,bothtermsetsneededto occurin anoverlappingconfiguration,but in mostcasesin only oneor two passagesof thedocument.

Theseexamplesalsoshow how powerful certainselectedtermscanbe in finding thedocumentsthathave beenmarkedasrelevant.Thevectorspacemodelandothersimilaritycomparisonmodelsaredesignedto determinewhich termsareimportanttermsautomat-ically, usually using via inversedocumentfrequency. In future work I plan to usetheTileBar representationon vector spacescoresto help determinewhich partsof the longtexts contributeto theoverall vectorspacerankings.

3.5.2 Similarity-based Passage Retrieval Experiments

Sofar thischapterhasfocussedontheuseof termdistributionin passage-basedretrieval.Therehasbeenasmallamountof work onapplicationof similarity-basedmeasuresto fulltexts; thiswork is discussedin thissection.

Page 66: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 64

This last is perhapsquestionablesincethetopic descriptionasksfor actionstakenagainstprovenor suspecteddumping,andthepassagefrom document3709describesanavoidanceof a dumpingcharge.

If the hypothesesaboutterm distribution hold true, thendocumentsthat are listed inFigure 3.9 but do not demonstrateoverlap, suchas 1022, 3738, and 3697 shouldnotbe relevant. An examinationof their contentsrevealsthat oneparagraphin 1022canbeconsideredrelevant,althoughJapanisnotmentionedspecifically, norisAmericaorEurope:

[ ... ]

Talking of soapoperas,Toshibaappearsto be doing its partnerIBM andthe othermanufacturersinvolved in the disputeover the alleged dumping of liquid crystaldisplays(CI No 1,501)no favours: theHeraldTribunequotesTakashiShimada,topengineerin Toshiba’s electrontubegroup,assaying"in termsof importance(to thecomputersystem)ourexecutivessaythe1990sequivalentof theDRAM chip is liquidcrystaldisplays"- cuemorehystericalyellow perilism.

[ ... ]

but the othertwo have irrelevantreferences(e.g.,dumpingdataontoa tape). Document2003 presentsconflicting messages:it turns out to be a seriesof very short newsbite,including:

[ ... ]

Also: JAPANESESPEECHRECOGNITIONPROJECT

AUDIOTEX SYSTEMBUSINESSBRISKCENTURY HIGH SCHOOLDEDICATEDUSERSBEMOAN QUALITY, TRAINING

TECHNOLOGYDUMPING IN MALAYSIA

[ ... ]

Anotherexampletopicdescriptionis shown below:

Topic 034

<dom>Domain:

ScienceandTechnology

<title>Topic:

EntitiesInvolvedIn Building ISDNApplicationsandDevelopingStrategiesto ExploitISDN

<desc>Description:

Documentmust describeapplicationscompaniesplan to build (are building, havebuilt) for themselvesor for others,which exploit ISDN’s servicesandcapabilitiesoridentify generalstrategiesfor usingISDN.

Page 67: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 63

Thesetwo passagesbothseemrelevantto thetopicdescription.In Document3557(ZF07-376-770),thetiling is incorrect,becauseit consistsof aseries

of veryshortnewsclips. Perhapsin partfor this reason,thedocumentis not relevant:

[ ... ]

Easingthetradetensiona little, theEuropeanCommissionhas

======================== 1 ========================

lifted the anti-dumping dutieson photocopiersassembledwith the CommunitybyToshibaCorpandMatsushitaElectronicIndustrialCo on thegroundsthatEuropeancontentnow exceeds40%: theonly company still sufferingdutiesis now KonicaInc.

- o -

For ThornEricssonTelecommunicationsLtd, readEricssonLtd: the

Horsham,Sussex-basedcompany,now wholly-ownedbytheSwede,officially changedits nameonJanuary1.

- o -

Citing figuresfrom theElectronicIndustriesAssociationof

Japan, theAmericanElectronicsAssociationnow saysthattheUSshareof worldwideelectronicsproductionfell to 39.7%in 1987from 50.4%in 1984while theJapaneseshareroseto 27.1%from 23.1%over the sameperiod and that of Europeroseto26.4%from 23.5%,althoughthatfiguremasksa decline,becausetheEuropeansharehit 27.6%in 1986;thenewly industrialisedcountriesof theFar Eastsaw their 1987sharehit 6.8%,from 4.9%in 1984.

[ ... ]

In Document3709(ZF07-554-808)we find:

[ ... ]

TheEuropeanCommunity,whoseCommonAgriculturePolicy keepsfoodpriceshigh,andwhich hasfailed to persuademonopolyEuropeanairlinesto reduceair faresthatborderon the racketeering,hasnow succeededin ensuringthat at timesof memorychipgluts,Europeanmanufacturersthatusechipsin theirproductswill notbeabletobuy the thingsat thebestpricesavailableto competitorsin otherpartsof theworld,but insteadwill haveto bankrollmanufacturersin Japan: theCommissionhascoerced11Japanesemanufacturers- FujitsuLtd, HitachiLtd, MitsubishiElectricCorp,NECCorp,ToshibaCorp,

[Sh]arpCorp,Sanyo DenkiCo,MinebeaCoandOki ElectricIndustryCo- to setfloorpricesfor chipsthey export to Europe;thepricesarebetween8% and10%above theaveragecostof production,weightedfor eachcompany’s output;theagreementswillbegoodfor fiveyears,andsolongastheJapanesemakerskeeppricesabovethefloor,they will facenodumping duties.

[ ... ]

Page 68: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 62

[ ... ]

One of the more worrying prospectsfor 1989 is for a big surge of protectionism,andin theUS,AT&T Co hasaskedtheInternationalTradeCommissionto look intoallegeddumpingby manufacturersin Japan, SouthKoreaandTaiwanof smallPABXsand key systems: AT&T claims that US firms have beenseverely injured by thepracticesof morethana dozenFar Eastmanufacturersmarketingsystemsat unfairpricesundermorethan17 brandnames;thecompaniesnamedin thecomplaintareToshiba,Matsushita,Hasegawa, Iwatsu,Meisei,Makayo,NitsukoandTamura,all ofJapan; Goldstar, SamsungandOPCof SouthKorea,andSunMoon Starof Taiwan;AT&T saysthepracticeshaveenabledthecompaniesto raisetheirshareof themarketto 60%from 40%since1985.

[ ... ]

TheTileBarsfor eachof thesedocumentsdisplayappropriateoverlap.But whataboutthedocumentswhoseTileBarsindicateoverlap,butarenotmarkedrelevant?Someof thesearedocuments2413,2859,3557,and3709. In only onecase(2413)doeseithertermsetoccurfrequently, sotheothersmightbeirrelevantreferences.Thepertinentfragmentsareshown below; threeoutof four couldbeconsideredrelevant.

In Document2413(ZF07-387-928),tile 4:, wefind:

[ ... ]

Japanhas removed all the controlson exports of memorychips to the EuropeanCommunityin compliancewith internationaltraderules,theEuropeanCommissionsaid: the restrictionsarosefrom the controversial third country fair marketvalueprovisionsof the US-JapanSemiconductorTradeAgreement,which weredeclaredillegalundertheGeneralAgreementon Tariffs & Trade- but theCommissionis stillstudyingpossibledumping of memorychipsin Europeby Japan.

[ ... ]

In Document2859(ZF07-755-876),hasthefollowing passage:

[ ... ]

JapaneseprintermanufacturersStarMicronicsandNEC Corp,presentlyusingtheirUK plants to penetratethe Europeanmarket,have agreedto increasethe numberof Europeancomponentsin their machines,so avoiding the EuropeanCommunityanti-dumping taxesrecentlyimposedon them: last week,a sitting of the EuropeanCommissionfoundthatfewerthan40%of thecomponentscamefromEuropeanfirms,andassuchtheprinterscameunderthesametax ruling asdirectimportsfrom Japan- around$15 dollars a printer for Star and $33 for NEC; accordingly, both firmshave undertakento includemoreEuropeancomponents,andif this is acceptedat theCommission’snext sitting, thetaxeswill beduly annulled.

[ ... ]

Page 69: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 61

150 OCR software moves into the mainstream optical character rec

720 Minigrams M

744 The European Community and information technology effect of

942�

Minigrams M

946�

Minigrams M

957�

Minigrams M

1022 Minigrams M

1700 Life after dumping key system market

1765 Japan’s view of EC 92 1992 single European market transcript

1785 Sony acquiring Grass Valley Commerce no to Corning amp Japan

1819 Minigrams M

2003�

Newsbytes Index week of Aug 1 1989 highlights

2184 Minigrams M

2413�

Minigrams M

2596�

Minigrams M

2652 Minigrams M

2859�

Minigrams M

3557�

Minigrams M

3670�

Minigrams M

3697�

Tokyograms M

3709�

Minigrams M

3738�

Minigrams M

Total: 22�

Figure3.9: TileBarsfound in responseto a simplifiedversionof TRECtopic description005. TermSet1 = dumpdumpinganti-dumpingandTermSet2 = japanjapanese.

Page 70: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 60

Still othertopic descriptionsrequirethe mentionof a company nameor a countryorsomeotherpropernounin conjunctionwith a generaltopic, e.g.,companiesworking onmultimediasystems.This is mostlikely meantto simulatea filtering or message-stuffingtask,asin theMUC competitions(Sundheim1990).It alsorequiresrecognitionof country,company, andotherpropernames. This is a casewheredistributional informationwillplay a role in somecases,but againoften the relevant termsneedbe foundonly locally.Still othertopicsincludea context or environmentin which a topic is to bediscussedhasbeenspecified.Thiskind of topicmightbenefitfrom anunderstandingof termdistributioninformation.

Below I show twoexamplesof TRECqueries,theirtranformationsintoTileBarrepresen-tations,andthedifferentcharacteristicsthatcanbediscernedabouttherelevantdocumentsusingthis representation.

Considerthefollowing TRECtopicdescription:

Topic 005<dom>Domain: InternationalEconomics

<title> Topic: DumpingCharges

<desc>Description:

TheU.S.or theECchargesJapanwith dumpingaproductonany marketand/ortakesactionagainstJapanfor provenor suspecteddumping.

<narr>Narrative:

To berelevant,a documentmustdiscusscurrentchargesmadeby theU.S.or theECagainstJapanfor dumpinga producton the U.S., EC, or any third-countrymarket,and/oraction(s)takenby the U.S. or the EC againstJapanfor proven or suspecteddumping.Theproductmustbeidentified.

<con>Concept(s):

1. dumping

2. duties,tariffs,anti-dumpingmeasures,punitivetradesanctions,protectivepenalties

3. below market,unfair, predatorypricing

4. CommerceDepartment,InternationalTradeCommission(ITC), EuropeanCom-munity (EC),CommonMarket

5. ruling, charges,investigation

Figure3.9 shows the resultsof searchingon dumpdumpinganti-dumpingandjapanjapanesein thesubsetof ZIFFused.Therelevancejudgmentsassignedby theTRECjudgesstatethatof thevisible documents,thefollowing onesarerelevant: 1700,1765,2184,and3670.Forexample,fromDocument2184(ZF07-376-802),whichis judgedrelevant,comesthefollowing passage:

Page 71: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 59

retrieval testsets;that is, no testsetsin which portionsof long texts have beenidentifiedasrelevant for a queryset. Theclosestavailableis therecentTIPSTER/TRECcollectionandrelevancejudgments(Harman1993),but althoughthis collectionincludessomelongdocuments,it doesnot includerelevancejudgmentsfor passagesalone.

Severalquestionsneedto beaddressedin thestudyof passageretrieval, relatedto thediscussionsof theprevioussections.For example,givena retrievedpassage,wherein thetext did the passagecomefrom: the beginning,middle, endof the document?How area passage’s neighborsin a documentrelatedto it? Wasthepassagechosenbecauseof itscontribution in isolationto therelevanceof thedocumentor is it just a representativepart,andif so, representative in what way? If chosenfor a Booleanquery, how muchandinwhatcontext doeseachtermof thequerycontribute?Thereis a needfor a testcollectionfor passageretrieval thatis sensitive to thesekindsof distinctions.

Researchersworking with hypertext have exploredissuespertainingto organizingin-formationwithin oneor a few long documents,but have not focusedon issuesrelatedtopresentingisolatedpiecesof textsdrawnfromalargecollectionof texts. Fulleret al. (1993),in discussingstrategiesfor hypertext, maketheimportantsuggestionof providing contextfor thetext nodesthatareretrievedasaresultof aquery, ratherthanjustpresentingalist ofrelevantnodes.They contrasttheapproachin standardinformationretrieval, in which thestructureis notaccessibleto thesimilarity engineor viewableby theusers,with hypertextsystemsthatdonotprovidegoodsearchcapabilitiesor sophisticatedstoragesystems.Theydonotsupplyviablesolutionsto theproblem,however.

3.5.1 An Analysis of two TREC Topic Descriptions

As mentionedabove, the relevancejudgesfor TRECwerenot concernedwith distin-guishingretrieval of passagesversusretrieval of documentsoverall. Bearingin mind thatonly asmallpercentageof theTRECdocumentsarelong,thisis notsurprising.But thefactthatrelevancejudgmentsdo not referto particularpartsof long documentsis problematicfor thepurposesof trainingandevaluatingpassageretrieval algorithms.Anotherproblemwith thecollectionis that thedocumentshave not beenrankedaccordingto their relativerelevance,so thereis no way to know whatvariationsin rankingareto bepreferredfor aquerythathasmany positive relevanceassignments.

It is an illuminating exerciseto convert TREC topic descriptionsto representationsapplicableto TileBars. Someof thetopic descriptions,althoughlong anddetailed,canbeaddressedby simplyfindingthedocumentswith afew key terms.For example,all andonlythedocumentsin the ZIFF subsetthatcontainthe word superconductivity arerelevant toTopic021.Many of thetopicdescriptionsrequireaparticularproductor company nametobeidentified,or a company namein conjunctionwith someotherspecificallynameditem.Relevantdocumentsfor thiskind of topicdescriptionoftenhaveall thekey termsin asinglesentence.In thesecasesonly very localpartsof alongtext needto matchin orderto satisfythequery. In othercases,topic descriptionsrequirethe topicsto bediscussedthroughoutthedocument.

Page 72: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 58

1586 CD-ROM takes on dial-up data O

1655 Multimedia and the network O

1669 Free and easy CD-ROM applications column

1712 Byting Barker report from Europe Commodore Total Dynamic Vis

1797 The compact disc myth a lousy paradigm for the computer indu

1808 Multifunction optical disks to drive market Tech Trends

1978 Life before the chips simulating Digital Video Interactive t

2003�

Newsbytes Index week of Aug 1 1989 highlights

2051�

Thoughts and observations at the Microsoft CD-ROM Conference

2130�

Newsbytes Index illustration

2156�

Home is where the interaction is compact disk-interactive in

2238�

The road to respect digital video interactive Video Special

2497 Newsbytes index highlights M

2730�

Release 1 0 calendar April 1989-March 1990 M

2761 Multimedia about interface Macintosh graphical user interfac

2774 Getting the facts strategies for using the PC’s power to hel

3349�

CD-ROMs the BMUG PD ROM Educorp CD ROM ClubMac Software Revi

3753�

Innovation busting out all over Japan Report M

3811�

Stephen Manes column

Figure3.8: SomeTileBarsfoundin responseto aqueryin whichTermSet1 is cd-romandTermSet2 is game.

Page 73: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 57

Doc: 2238 cd-rom:13 game:2

Doc: 2003 cd-rom:9 game:3

However, takinginto accountthenumberof tiles eachtermoccursin changesthepicture:

Doc: 2238 cd-rom:13 10/25tiles game:2 2/25tiles

Doc: 2003 cd-rom:9 2/20tiles game:3 3/20tiles

We seethat the referencesto cd-romin 2238arequite spreadout, whereasthosein2003arequite localized. Theonly questionthat remainsis whetheror not the localizeddiscussionsof cd-rom in 2003 coincidewith thoseof game. From the context bar wecaneasilyseethat they do not, andso we assumethe documentis not of interest. Thediscussionin 2238mightalsobebunchedtogether, asis thecasein 1808,but in thiscaseitis morespreadoutandwecanguessthattheuseof gamein thiscontext bearsat leastsomemeaningfulrelationshipto CD-ROMs.

Upon inspectingthe documents,we seethat 2003consistsof a sequenceof disjointnewsbites,whereas2238describesapplicationsof CD-ROM technology, includinga golfgameapplication. Also verifying our suspicionsaboutdocument1808,we seethat thelagginguseof gamehere,far away from all thecd-romreferences,is a metaphoricaloneaboutpredictingpricesfor WORM devices(“a dart-throwing game”).Note,however, thattherewouldhavebeensomeoverlapin this caseif thequeryhadbeenonwormandgame,but it will againhave appearedto bea passingreference.

Thisdiagramhasanotherinterestingcasein whichit seemsclearthatadensediscussionof the two termstakesplace,althoughfor only part of thedocument,in document3753.Clicking in themiddleof thisdiscussionindeedrevealsadiscussionof theuseof CD-ROMsfor gameplay.

The first tile of document1669 leadsinto a discussionof the utility of CD-ROMtechnologyby mentioninga list of applications,including games,an encyclopedia,andmusic-appreciationsoftware.And not surprisingly, dueto thepatternof intensitiesof thetermoccurrences,document3811is a review of variesCD-ROM-basedgames.

From theseexamplesit shouldseemlikely that with very little exposurea usercanbecomefluentin interpretingTileBars.

3.5 Passage-based Information Access

This chapterhasalludedto issuesrelating to passage-level informationaccess;thissectiondiscussessomegeneralissuesand the moreconventionalapproachesto passageretrieval. To datetherehasbeenlittle researchon passageretrieval, most likely for thereasonsstatedat the beginningof the chapter;especiallythe lack of availableonline fulltext for experimentation. An accompanying importantfact is that thereare no passage

Page 74: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 56

Figure3.7: Theresultsof sortingTileBarsaccordingto the frequency anddistribution ofthe queryterms. As before,Term Set1 consistsof law, legal, attorney, andlawsuit andTermSet2 consistsof networkandlan.

Page 75: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 55

selectingthis viewing point seemssensible,yielding a discussionof a documentationmanagementsystemona networkedPCsystemin a legal office.

Theremainingdocumentswith strongdistributionsof legalterms–IDs1640,1766,1781– discussa lawsuit betweensoftwareproviders,computercrime, andanotherdiscussionof a law firm usinga new networkedsoftwaresystem,respectively. Appropriately, onlythelatterhasoverlapwith networkingterms,sincetheothertwo documentsdonotdiscussnetworkingin the legal context. Interestingly, the solitary mentionof networkingat theendof 1766lists it asa computercrime problemto be worried aboutin the nearfuture.This is an exampleof the suggestive natureof the positionalinformationinherentin therepresentation.

Finally, lookingat theseeminglyisolateddiscussionof document1298we seea letter-to-the-editoraboutthelackof liability andpropertylaw in theareaof computernetworking.Thisletterisoneof severalletters-to-the-editor;henceits isolatednature.Thisisanexampleof aperhapsusefulinstanceof isolated,but stronglyoverlapping,termoccurrences.In thisexample,onemight wonderwhy onelegal term continueson into thenext tile. This is acasein which thetiling algorithmis slightly off in theboundarydetermination.

As mentionedabove, theremainingdocumentsappearuninterestingsincethereis littleoverlapamongthetermsandwithin eachtile thetermsoccuronly onceor twice. We canconfirm this suspicionwith a coupleof examples. Document1270 (type F/G) hasoneinstanceof a legal term; it is a passingreferenceto the formerprofessionof an interviewsubject. Document1356 (type I/H) discussesa court’s legal decisionaboutintellectualproperty rights on information. Tile 3 provides a list of ways to protect confidentialinformation,oneitem of which is to avoid storingconfidentialinformationon a LAN. Soin this casethereferenceis relevantif notcompelling.

Figure3.7 shows the resultsof thesamequerywhenplacedin an interfacethat sortsthe termsaccordingto their frequency and patternsof distribution. The upperlefthandwindow displaysthedocumentsin which bothtermsetsoccurin at least40%of thetiles.The upperrighthandwindow shows thosedocumentsin which at least40% of the tileshave occurrencesof termsfrom Term Set 1, but occurrencesfrom Term Set 2 are lesswell-distributed. The lower lefthandwindow shows the symmetriccase,and the lowerrighthandwindow displaysthedocumentsin which neithertermoccursin morethan40%of thetiles. Within eachwindow thedocumentsaresortedby overallquerytermfrequency.Experimentsneedto berun to evaluatetheeffectivenessof variationsin patterncriteria.

CD-ROMs and Games

Section3.3 hypothesizedaboutthe role of mediumfrequency terms. This exampleexamineshow termdistributioncanmakeadifferencein whetheror nottwo termsetsstandin amodificationalrelationship.In responseto aqueryoncd-romandgame, 49documentswereretrieved. Figure3.8showsaclip of someof thedocuments’TileBars.

Viewed by frequency alone,documents2238 and 2003seemequally viable (or notviable):

Page 76: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 54

Figure3.6: Theresultsof clicking on thefirst tile of document1433: thesearchtermsarehighlightedandthetile numberis shown.

Lookingnow at theactualdocumentswe candeterminetheaccuracy of theinferencesdrawn from theTileBars. Clicking on thefirst tile of document1433bringsup a windowcontainingthecontentsof thedocument,centeredonthefirst tile (seeFigure3.6).Thesearchtermsarehighlightedwith two differentcolors,distinguishedby termsetmembership,andthetile boundariesareindicatedby ruled linesandtile numbers.Thedocumentdescribesin detailtheuseof a networkwithin a legaloffice.

Lookingatdocument1300,theintersectionbetweenthetermsetscanbevieweddirectlyby clicking on theappropriatetile. FromtheTileBar we know in advancethat the tile tobe shown appearsaboutthreequartersof the way throughthe document. Clicking hererevealsa discussionof legal ramificationsof licensingsoftwarewhendistributing it overthenetwork.

Document1471hasonly thebarestinstanceof legal termsandsoit is not expectedtocontainadiscussionof interest– mostlikely apassingreferenceto anapplication.Indeed,thetermis usedaspartof a hypotheticalquestionin anadvicecolumndescribinghow toconfigureLANs.

Theexpectationfor 1758is thatit will discussbothtermsets,althoughnotasintenselyasdid1433.Sincesomeof theterminstancesconcentratenearthebeginningof thisdocument,

Page 77: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 53

Networks and the Law

Figure3.5showssomeof theTileBarsproducedfor thequeryonthetermsetslaw legalattorney lawsuitandnetworklan. In thisportionof theZIFFcollection,thetermsof interesthave thefollowing averagesof occurrence,in thedocumentsin which they appearat leastonce:

¯� �legal 2.4 3.6law 2.8 4.2attorney 1.5 1.0lawsuit 2.3 3.5network 10.7 5.2lan 6.8 10.2

What kind of documentscan we expect to find in responseto this query? Use ofcomputernetworksby law firms, lawsuits involving illegal useof networks,andpatentbattlesamongnetworkvendorsareall possibilitiesthatcometo mind. We know thatsincewe aresearchingin a collectionof commercialcomputerdocuments,most instancesofthewordnetworkwill refer to thecomputernetworksense,with exceptionsfor telephonesystems,neuralnetworks,andperhapssomereferencesto theconstructusedin theoreticalanalyses.Sincelegal is anadjective, it canbeusedasa modifierin a varietyof situations,but togetherwith theothertermsin its set,a largeshowing of thesetermsshouldindicatealegitimateinstanceof adiscussionin thelegal frame.Thesetwo termsetswerespecificallychosenbecausetheir meaningsare usually in quite separatesemanticframes; the nextexamplewill discussaqueryinvolving termsthataremorerelatedin meaning.

In Figure 3.5, theresultshave not beensortedin any mannerotherthandocumentIDnumber. It is instructiveto examinewhatthebarsindicateaboutthecontentof thetextsandcomparethatagainstthehypothesisof Section3.3andagainstwhatactuallyis discussedinthetexts. Document1433jumpsout becauseit appearsto discussbothtermsetsin somedetail(typeA from thechart).Documents1300and1471arealsoprominentbecauseof astrongshowing of thenetworktermset(typeC). Document1758alsohaswell-distributedinstancesof bothtermsets,althoughwith lessfrequency thanin document1433(typeH).Legal termshave a strongdistributionalshowing in 1640,1766,1781aswell (typesC andG). We alsonotea largenumberof documentswith very few occurrencesof eitherterm,althoughin somecasestermsaremorelocally concentratedthanin others.Document1298is interestingin that it seemsto have an isolatedbut intensediscussionof both term sets(typeH); thefactthatneithertermsetcontinuesoninto therestof thedocumentimpliesthatthis discussionis isolatedfrom therestin meaningaswell. Most of theotherdocumentslook uninterestingdueto their lackof overlapor infrequency of termoccurrences.

Page 78: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 52

usefulfor long texts,with theirvariedinternalstructure,thanabstracts.However, this ideahasnotyetbeenimplemented.

TileBarsdisplaycontext correspondingdirectly to theusers’query;specificallyto thetermsusedin a free-text search.Sometimes,however, theuseris unsureof whatkind ofqueriesto makeandneedsto get familiar with new textbasesrapidly. Chapter5 describestheuseof maintopic informationto helpprovidecontext in thissituation.

Implementation Notes

Thecurrentimplementationof the informationaccessmethodunderlyingthe TileBardisplaymakesuseof � 3800texts of length100-500lines from the ZIFF portion of theTIPSTERcorpusand � 250 texts of the samelengthfrom the AP portion of TIPSTER,for a total of about57Mbytes(Harman1993). (ZIFF is comprisedmainly of commercialcomputernews and AP is world news from the late 1980s.) The interfacewas writtenusingtheTcl/TK X11-basedtoolkit (Ousterhout1991). Thesearchenginemakesuseofcustomizedinvertedindex codecreatedespeciallyfor this task2; eachterm is indexedbydocumentandtile number, andtheassociatedfrequencies.In thefuturethismaybereplacedwith the POSTGRESdatabasemanagementsystem,which hassupportfor large objectsanduser-definedtypes(Stonebraker& Kemnitz1991). An alternative indexing stratumisthatof GLIMPSE(Manber& Wu 1994)(built onagrepWu & Manber(1992))whichstoresasmallindex (about2-4%of thesizeof thetext collection)but hasanacceptablespeedformany tasks.

Theinformativenessof theTileBar representationis hinderedwhentheresultsof tilingareinaccurate.TheZIFF databasecontainsmany documentscomprisedof listsof concate-natedshortnews articles,andsomedocumentscomprisedof single-linecalendaritems.Thetiling algorithmis setup sothata singleline segmentis too fine a division; therefore,documentslike thecalendartext will have erroneoustilings (althougharbitrarygroupingsontermslike theseareperhapspreferableto assigningeachsentenceits own tile, dueto ef-ficiency considerations).Thealgorithmdoesdofairly well atdistinguishingslightly longerconcatenatedarticles,suchassequencesof paragraph-longnews summariesandletterstotheeditor. It is alsoquitegoodat recognizingtheboundariesof summarizinginformationat thebeginningof articleswhensuchinformationappears.

3.4.4 Case Studies

This sectionexaminesthe properiesof TileBars in more detail, using two examplequerieson theZIFF corpus.

2I amgratefulto MarcTeitelbaumfor theswift implementationof thiscode.

Page 79: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 51

� If shadingis used,makesuredifferencesin shadingline up with the valuesbeingrepresented.The lightest (“unfilled”) regionsrepresent“less”, anddarkest(“mostfilled”) regionsrepresent“more”. (Kosslynet al. 1983)� Becausethey dohaveanaturalvisualhierarchy, varyingshadesof grayshow varyingquantitiesbetterthancolor. (Tufte1983)

Note that the stackingof the termsin the query-enteringportion of the documentisreflectedin thestackingof thetiling informationin theTileBar: thetop row indicatesthefrequenciesof termsfrom TermSet1 andthebottomrow correspondsto TermSet2. Thustheissueof how to specifythekeytermsbecomesa matterof whatinformationto requestin theinterface.

TileBarsallow theuserto beawareof whatpartof thedocumentthey areaboutto viewbeforethey view it. If they feel they needto know moreof whatthedocumentis abouttheycansimply mouse-clickon a part of the representationthat symbolizesthe beginning ofthedocument.If they wish to go directly to a tile in which termoverlapoccurs,they clickon thatportionof thetext, knowing in advancehow far down in thedocumentthepassageoccurs.

The issueof how to rank the documents,if rankingis desired,becomesclearernow.Documentscanbegroupedby distributionpattern,if this is foundto beusefulfor theuser.Eachpatterntypecanoccupyits own window in thedisplayanduserscanindicateprefer-encesby virtueof whichwindowsthey use.Thusthereisnosinglecorrectrankingstrategy:in somecasesthe usermight wantdocumentsin which the termsoverlapthroughout;inothercasesisolatedpassagesmightbeappropriate.Figure3.7showsanexamplein whicha query’s retrieval resultshave beenorganizedby distributionpatterntype.

Relevancefeedbackis generallyperceived asan effective strategy for improving theresultsof retrieval (Salton& Buckley 1990). In relevancefeedback,thesystemrespondsto input from the userindicating which documentsare of interestand which are to bediscarded.Fromthis informationthesystemcanguesshow to downweightsometermsandincreasethe weight on otherterms,aswell as introducenew termsinto the querybasedon the documentsthat the userfoundespeciallyhelpful. Relevancefeedbackappearstowork well becausethe userhelpssetterm weights,indirectly specifyingwhich formulasbetterdescribethekind of informationbeingsought.However, thegatheringof relevancefeedbackis time-consuminganddraining on the user, sinceit requiresthe userto readthetext for contentandguesswhetheror not thetermsof thedocumentwill beusefulforfindingotherinterestingdocuments.

TileBarscould provide a relevancefeedbackmechanismin which userscanindicatepatternsof interestaswell asor insteadof termsof interest.Relevancefeedbackbasedonpatternsshouldbemoreeffective thanrequiringa specificationof whatkindsof patternsaredesiredin advance,or requiringthe entryof a queryin termsof subtopic/maintopicor someotherrelationship.It couldalsoactasanalternativeor a supplementto relevancefeedbackon term similarity, sinceasarguedabove, overall similarity is lesslikely to be

Page 80: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 50

Figure3.5: TheTileBar displayparadigm.Rectanglescorrespondto documents,squarescorrespondto TextTiles, thedarknessof a squareindicatesthe frequency of termsin thecorrespondingTerm Set. Titles and the initial words of a documentappearnext to itsTileBar. Term Set1 consistsof law, legal, attorney, lawsuit andTerm Set2 consistsofnetworkandlan.

Page 81: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 49

� Long texts differ from abstractsandshort texts in that, alongwith term frequency,termdistribution informationis importantfor determiningrelevance.� Therelationshipbetweentheretrieveddocumentsandthetermsof thequeryshouldbepresentedto theuserin a compact,coherent,andaccuratemanner(asopposedtothesingle-pointof informationprovidedby a ranking).� Passage-basedretrievalshouldbesetupto providetheuserwith thecontext in whichthepassagewasretrieved,bothwithin thedocument,andwith respectto thequery(this issueis discussedin moredetailin Section3.5).

Figure3.5showsanexampleof anew representationalparadigm,calledTileBars, whichprovidesa compactandinformative iconic representationof thedocuments’contentswithrespectto the query terms. TileBars allow usersto makeinformeddecisionsaboutnotonly which documentsto view, but alsowhich passagesof thosedocuments,basedon thedistributionalbehavior of thequerytermsin thedocuments.Thegoalis to simultaneouslyindicate the relative length of the document,the relative frequency of the term setsinthe document,their distribution with respectto thedocument,andtheir distribution withrespectto eachother. Eachlarge rectangleindicatesa document,andeachsquarewithinthedocumentrepresentsa TextTile. Thedarkerthetile, themorefrequenttheterm(whiteindicates0, black indicates9 or more instances,the frequenciesof all the termswithina term setareaddedtogether). Sincethe barsfor eachset of query termsare lined uponenext to the other, this producesa representationthat simultaneouslyandcompactlyindicatesrelativedocumentlength,querytermfrequency, andquerytermdistribution. Therepresentationexploits thenaturalpattern-recognitioncapabilitiesof thehumanperceptualsystem(Mackinlay1986);thepatternsin acolumnof TileBarscanbequickly scannedanddeciphered.I hypothesizethat the interpretationof thepatternsshouldbealongthe linesoutlinedin Section3.3. Somecasestudiesappearin Section3.4.4below.

Term overlap and term distribution are easyto computeand can be displayedin amannerin which bothattributestogethercreateeasilyrecognizedpatterns.For example,overalldarknessindicatesa text in whichbothtermsetsarediscussedin detail. Whenbothtermsetsarediscussedsimultaneously, their correspondingtiles blendtogetherto causeaprominentblock to appear. Scattereddiscussionshave lightly coloredtiles andlargeareasof whitespace.NotethatthepatternsthatcanbeseenherebearsomeresemblencetoFigure2.4 in Chapter2, in which termdistributionsfor a text aredisplayed.

TileBarsmakeuseof the following visualizationproperties(extractedfrom Senay&Ignatius(1990)):� A variation in position, size, value [gray scalesaturation],or texture is ordered

[ordinal] thatis, it imposesanorderwhich is universalandimmediatelyperceptible.(Bertin1983)� A variationin position,size,value[grayscalesaturation],textureorcoloris selective,thatis, it enablesusto isolateall marksbelongingto thesamecategory. (Bertin1983)

Page 82: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 48

Figure3.4: A fill-in-the-formstypeinterfaceto a bibliographicdataset.

Most query-formulationproblemscan be circumventedvia a forms-basedinterface.Davis (1994)hasrecentlydevelopedsuchan interfaceto thebibliographicrecordsof theCNRI computerscienceonlinetechnicalreportproject(seeFigure3.4). In this interface,theoptionsarespelledoutclearly, all optionsarevisible,andtheinterfaceitself suppliesthesyntaxfor thequery. A similarsituationarisesin theworldof databasemanagementsystems.Mucheffort hasbeenexpendedontrying to determinetheright wayto formulatekeyword-and-syntaxbasedquerylanguages,whenevidencesuggeststhatgraphically-orientedwaysof specifyingthequeryarepreferablefor mostkindsof queries(Bell & Rowe1990).

There is an analogybetweensystemsthat requireobscurekeyword languagesandsystemsthat displayresultsbasedon an invisible rankingalgorithm. Neithersupplytheuserwith arepresentationthatreflectstheunderlyinginformation.Bothprobablyarosedueto thelimitationsof computerhardwareat thetime,andunfortunatelyarestill in usetoday.

3.4.3 TileBars

Thissectionpresentsonesolutionto theproblemsdescribedin theprevioussubsections.Theapproachis synthesizedin reactionto threehypothesesdiscussedearlier:

Page 83: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 47

(althoughit is legal to enter“fi tw democracy andtw america”thisresultsin amuchlongersearchdueto theindexing structureunderlyingthesystem(Farley 1989)).However, to finda bookby two authors,two copiesof thekeyword pa (andtheAND connective) mustbeused,asdemonstratedby theerrorbelow:

CAT-> find pa mosteller wallace

Search request: FIND PA MOSTELLER WALLACESearch result: 0 records at all libraries

Please type HELP

CAT-> find pa mosteller and pa wallace

Search request: FIND PA MOSTELLER AND PA WALLACESearch result: 2 records at all libraries

Type D to display results, or type HELP.

GLADIS is the othermajor bibliographicsystemavailableat UC Berkeley, indexingmainly the local collection and providing timely information suchas check-outstatus.Unfortunately, its keyword list is slightly differentand the interfaceis unforgiving withrespectto this:

===> find pa tocqueville**> THE SEARCH CODE WAS NOT RECOGNIZED**> Type a search code listed above**>===> find pn tocqueville

Your search for the Personal Name: TOCQUEVILLEretrieved 41 name entries.

Anotherproblemwith thesesystemsis that althoughthey have somevery powerfulspecialpurposesearchcapabilities(suchas the capability to look for PhD dissertationsspecifically, in thecaseof MELVYL) usersareunawareof theoptionsbecausethey requireknowledgeof specialkeywords.

Page 84: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 46

Index conf.announce contains the following 164 items relevant to’image network’. The first figure for each entry is its relativescore, the second the number of lines in the item.

image network

* 1000 1190 /ftp/pub/conf.announce/jenc5* 886 125 /ftp/pub/conf.announce/image.processing.conf* 800 334 /ftp/pub/conf.announce/image.analysis.symposium* 743 303 /ftp/pub/conf.announce/sans−III

This is a searchable index. Enter search keywords:

* 543 376 /ftp/pub/conf.announce/atnac.94* 486 133 /ftp/pub/conf.announce/sid* 486 125 /ftp/pub/conf.announce/qes2* 457 138 /ftp/pub/conf.announce/europen.forum.94* 429 378 /ftp/pub/conf.announce/mva.94* 429 785 /ftp/pub/conf.announce/openview.conf* 429 104 /ftp/pub/conf.announce/high.performance.networking* 400 217 /ftp/pub/conf.announce/nonlinear.signal.workshop* 429 378 /ftp/pub/conf.announce/vision.interface.94* 429 785 /ftp/pub/conf.announce/inet.94* 429 104 /ftp/pub/conf.announce/icmcs.94* 400 217 /ftp/pub/conf.announce/internetworking.94* 371 220 /ftp/pub/conf.announce/iss.95* 371 168 /ftp/pub/conf.announce/qes1* 343 152 /ftp/pub/conf.announce/conti.94* 343 247 /ftp/pub/conf.announce/elvira

Figure3.3: A sketchof theresultsof aWAIS searchon imageandnetworkona datasetofconferenceannouncements.

Page 85: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 45

of thedocumentis indicatedby a number, which althoughinterpretable,is not easilyreadfrom thedisplay. Figure3.3 representstheresultsof a searchon imageandnetworkon adatabaseof conferenceannouncements.Theusercannotdetermineto what extent eithertermis discussedin thedocumentor whatrole thetermsplay with respectto oneanother.If theuserprefersa densediscussionof imagesandwouldbehappywith only a tangentalreferenceto networking,thereis nowayto expressthis preference.

Attemptsto placethis kind of expressivenessinto keyword basedsystemareusuallyflawed in that the usersfind it difficult to guesshow to weight the terms. If the guessisoff by a little they maymissdocumentsthatmightberelevant,especiallybecausetheroletheweightsplay in thecomputationis far from transparent.Furthermore,theusermaybewilling to look at documentsthat arenot extremelyfocusedon oneterm, so long asthereferencesto theothertermsaremorethanpassingones.Finally, thespecificationof suchinformationis complicatedandtime-consuming.

The concernin the information retrieval literatureabouthow to rank the resultsofBooleanandvectorspace-typequeriesis misplaced.Oncethereis a baselineof evidencefor choosinga subsetof the thousandsof availabledocuments,thenthe issuebecomesamatterof providing theuserwith informationthatis informativeandcompactenoughto beableto beinterpretedswiftly. As discussedin theprevioussection,therearemany differentwaysa long text canbe“similar” to thequerythatissuedit, andsowe needto supplytheuserwith a way to understandthe relationshipbetweenthe retrieved documentsandthequery.

3.4.2 Analogy to Problems with Query Specification

Therehave beenmany studiesshowing that usershave difficulty with Booleanlogicqueriesandmany attemptsatmakingthequeryformulationprocesseasier. Researchpapersdiscussat greatlengththerelative benefitsof onequerylanguageover another. However,this issueis circumventedto someextent if insteada systemprovides the userwith anintuitive,direct-manipulationinterface(Shneiderman1987).

A goodexampleof this is thedifferencebetweenthekeyword-basedinterfaceto largeonlinebibliographicsystems.Theuserhasto rememberthecorrectkeywordsto usefromsystemto system,and must rememberwhereto placeAND and OR connectives. Forexample,with MELVYL, theonlinebibliographicsystemfor theUniversityof California(Lynch1992),to look for a bookby deTocqueville containingtheword Democracy, onemustenter

fi pa tocqueville and tw democracy

wherepa indicates“personalauthor”andtw indicates“title words”. However, to find atitlewith bothwordsdemocracy andamericaoneneedenteronly onecopyof thekeyword tw:

fi tw democracy america

Page 86: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 44

differentiatingtermsthatoccurin backgroundsentencesfrom thosethat occurin spokenquotationsandthosethatarein leadsentencesin orderto betterunderstandthe relationsamongterms.

Giventheanalysissurroundingthechartof Figure3.2,how cantheseobservationsaboutrelativetermdistributionbeincorporatedintoaninformationaccesssystem?Thefollowingsectiondiscussesthis issue,first touchingon problemswith existing approaches,andthensuggestinganew solution.

3.4 Distribution-Sensitive Information Access

3.4.1 The Problem with Ranking

Noreault et al. (1981) performedan experimenton bibliographic recordsin whichthey tried every combinationof 37 weightingformulasworking in conjunctionwith 64combiningformulason Booleanqueries. They found that the choiceof schememadealmostno difference:the bestcombinationsgot about20%betterthanrandomordering,andno oneschemestoodoutabove therest.

Theseresultsimply that small changesto weightingformulasdon’t have muchof aneffect. As found in otheraspectsof text analysisfor informationretrieval, (e.g.,effectsof stemming,or morphologicalanalysis,or using phrasesinsteadof isolatedterms),amodificationof an algorithm improves the resultsin somesituationsand degradestheresultsin others.

Whymightthisbethecase?Perhapstheansweris thatthere is no single correct answer.Perhapstrying to assignnumbersto theimpoverishedinformationthatwe have aboutthedocuments(or in thiscaseof theexperimentin Noreaultet al. (1981),bibliographicrecords)isnotanappropriatethingtodo. It couldbethecasethatwhendifferentkindsof informationarepresentin thetexts thetermrankingservesonly to hidethis informationfrom theuser.Ratherthanhidingwhatis goingonbehindarankingstrategy, I contendit is betterto showthe userswhat hashappenedasa resultof their queryandallow the usersto determinefor themselves what looks interestingor relevant. Of course,this is the intendedgoalof ranking. But an orderedlist of titles andprobabilitiesis under-informative. The linkbetweenthe queryterms,the similarity comparison,and the contentsof the texts in thedatasetis toounderspecifiedto assumethatasingleindicatorof relevancecanbeassigned.

Instead,therepresentationof theresultsof theretrievalshouldpresentasmany attributesof thetexts andtheir relationshipto thequeriesaspossible,andpresenttheinformationina compact,coherentandaccuratemanner. Accuratein this casemeansa truereflectionoftherelationshipbetweenthequeryandthedocuments.

Considerfor examplewhathappenswhenoneperformsakeywordsearchusingWAIS(Kahle& Medlar1991). If thesearchcompletes,it resultsin a list of documenttitles andrelevancerankings.Therankingsarebasedon thequerytermsin somecapacity, but it isunclearwhat role thetermsplay or what thereasonsbehindtherankingsare. The length

Page 87: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 43

locally organized. Thereis probablya brief but real discussionof Term Set 1 inrelationto Set2, perhapsa subtopicto 2’s maintopic. If thefrequency is extremelylow (e.g,1), thenthis is probablya passingreference.

D Both termsareof mediumfrequency but globally distributed. Most likely thesamesituationasA, but somewhatlesslikely to befully aboutbothtermsets.

E Bothtermsetshavemediumfrequency; oneis locally distributedandoneglobally. Ifthey havesometileswith significantoverlapthenthedocumentisprobablyof interestif theuseris interestedin amaintopic/subtopic-likedistribution.

F TermSet2 hasmediumfrequency, TermSet1 is infrequent,andbotharescattered.Thetwo mightbeara relationshipto oneanotherbut thereis notenoughevidencetodecideeitherway. Lesslikely to beusefulthanin G.

G TermSet2 hasmediumfrequency, globallydistributed,andTermSet1 is infrequentbut localized.If thetwo overlapthereis agoodchanceof adiscussioninvolvingbothtermsetsbut with only a brief referenceto TermSet1.

H Bothtermsetshavemediumor low frequency andarelocalized.If they overlapthenthishassomechanceof beinga goodisolateddiscussion.If they donotoverlap,thedocumentshouldbediscarded.

I Bothtermsetsareinfrequent,onelocalized,onenot. Thisdocumentshouldprobablybediscarded.

J Bothtermsetsareinfrequentandgloballydistributed.Thisdocumentshouldprobablybediscarded.

Of coursetheseobservationsshouldbegeneralizedto morethantwo termsets,but formultiple termsetstheimplicationsof eachcombinationarelessclear.

Interestingly, Grimes(1975)hadtheprescienceto suggestthevalueof localizedinfor-mationasdeterminedby discoursestructure.Hewrotein 1975:

Now that informationretrieval is takingon greaterimportancebecauseof theprolif-erationof circulatedinformation,linguisticsmay have somethingto contribute to itthroughdiscoursestudies.In thefirst place,studiesof discourseseemto show thattheessentialinformationin somediscoursesis localized,which implied thatfor retrievalit might bepossibleto specifypartsof thediscoursethatdo nothave to betakenintoaccount.Thereis definitelya patternof organizationof informationin any discoursethatcanberecognizedandshouldthereforebeexploredfor its usefulnessin retrieval;for example,Halliday’s notionof thedistributionof givenandnew information.

Grimes’suggestionof usingthelocalizedstructureof discoursetoeliminatecertainpassagesis a usefulone,althoughdifferentthanthatsuggestedhere.Work alongrelatedlinesdoesappearin Liddy (1991),whichdiscussestheusefulnessof understandingthestructureof anabstractwhenusinganatural-languagebasedinformationretrieval approach,andLiddy &Myaeng(1993),which usesinformationaboutthekind of sentencea termoccursin; e.g.,

Page 88: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 42

Medium

Global

Low

Local

High

Global Local

Medium

Global

Low

High

Global

Medium

Global

Local

Medium

Global

Low

Low

Local

TERM SET 1

TERM SET 2

A A A B C

D E F G

H I

I

H

H

J

Figure3.2: Frequency anddistributionalrelationshipsbetweentwo termsets.Seethetextfor anexplanationof theletterlabels.

Page 89: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 41

A B

B

A BAA

B

(a) (b) (c) (d)

Figure3.1: Possiblerelationshipsbetweentwo termsin a full text. (a) Thedistribution isdisjoint, (b) co-occurringlocally, (c) termA is discussedglobally throughoutthetext, B isonly discussedlocally, (d) bothA andB arediscussedglobally throughoutthetext.

two term sets,Term Set1 andTerm Set2, wherea term set is a setof termsthat bearsomekind of semanticrelationshipto oneanother(e.g.,election,poll, andvoteor barney,dinosaur, andcloying). Thetermsetsareconsideredto besymmetric;that is, neitheroneis moreimportantthantheother, andsothelowertriangleof thechartis omitted.Within adocument,eachtermsetcanbecharacterizedasbelongingtooneof fourpossiblefrequencyranges:high,medium,low, andzero,andoneof two distributionpatterns:globalandlocal.(Termsetswith frequency zeroarenotconsideredin thechart.)Thefrequenciesaremeantto berelative to thelengthof thedocument,andthedifferencebetweenhigh,medium,andlow shouldbethoughtof asgraded.

For thepurposesof interpretingtermsetdistribution it is convenientto assumethatthedocumentshave beendivided into TextTiles: adjacent,non-overlappingmulti-paragraphunitsof text thatareassumedtocorrespondroughlyto thesubtopicstructureof thetext. Thedistinctionbetweenglobalandlocal distribution is alsomeantto berelative to documentlength. A termsetwith low frequency andlocal distribution occursin oneor two tiles; atermsetwith mediumfrequency andlocal distributionoccursin perhapstwo groupingsoftwo tiles each,or onegroupingof oneto threetiles. On the otherhand,a term setwithmediumfrequency andglobaldistributionwill have termsin roughlyhalf thetiles.

With theaid of this chartwe canform hypothesesabouttheroleof interactionsamongtermdistributionandfrequency andtheirrelationshiptodocumentrelevance(assumingthatif thetermsin a termsetoccurwith high frequency thenthey aregloballydistributed):

A Instancesof both term setsoccur with high frequency, or one term set is highlyfrequentandtheotherhasmediumfrequency. Thedocumentdescribesbothtermsetconceptsto a large extent; this would be usefulfor a userwho wantsa main topicdiscussionof bothconceptssimultaneously.

B Term set2 is quite frequent,Term Set1 infrequentandscattered;probablyusefulonly if theuseris primarily interestedin TermSet1.

C Term Set 2 is quite frequent,Term Set1 infrequentbut, as opposedto type B, is

Page 90: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 40

3.3 Long Texts and Their Properties

A problemwith applyingtraditionalinformationretrieval methodsto full-length textdocumentsis that the structureof full-length documentsis quite different from that ofabstracts.Abstractsarecompactandinformation-dense.Most of the (non-closed-class)termsin anabstractaresalientfor retrieval purposesbecausethey actasplaceholdersformultipleoccurrencesof thosetermsin theoriginal text, andbecausegenerallythesetermspertainto themostimportanttopicsin thetext. Consequently, if thetext is of any sizeablelength,it will containmany subtopicdiscussionsthatarenevermentionedin its abstract.

Whenauserengagesin asimilaritysearchagainstacollectionof abstracts,theuseris ineffectspecifyingthatthesystemfind documentswhosecombinationof maintopicsis mostlike that of the query. In otherwords,whenabstractsarecomparedvia the vector-spacemodel,they arepositionedin amulti-dimensionalspacewheretheclosertwoabstractsaretooneanother, themoretopicsthey arepresumedto havein common.Thisisoftenreasonablebecausewhencomparingabstracts,the goal is to discover which pairsof documentsaremostalike. For example,a queryagainsta setof medicalabstractswhich containstermsfor thenameof adisease,its symptoms,andpossibletreatmentsis bestmatchedagainstanabstractwith assimilara constitutionaspossible.

Most full text documentsarerich in structure.Onewayto view anexpositorytext is asasequenceof subtopicssetagainsta“backdrop”of oneor two maintopics.A longtext canbecomprisedof many differentsubtopicswhich mayberelatedto oneanotherandto thebackdropin many differentways.Themaintopicsof a text arediscussedin its abstract,ifoneexists,but subtopicsusuallyarenotmentioned.Therefore,insteadof queryingagainsttheentirecontentof a document,a usershouldbeableto issuea queryabouta coherentsubpart,or subtopic,of afull-lengthdocument,andthatsubtopicshouldbespecifiablewithrespectto thedocument’smaintopic(s).

Figure3.1illustratessomeof thepossibledistributionalrelationshipsbetweentwotermsin themain topic/subtopicframework. An informationaccesssystemshouldbeawareofeachof the possiblerelationshipsandmakejudgmentsas to relevancebasedin part onthis information. Thusa documentwith a main topic of “cold fusion” anda subtopicof“funding” wouldberecognizableevenif thetwotermsdonotoverlapperfectly. Thereversesituationwouldberecognizedaswell: documentswith a maintopic of “funding policies”with subtopicson“cold fusion” shouldexhibit similarcharacteristics.

Notethata queryfor a subtopicin thecontext of a maintopic shouldbeconsideredtobequalitatively differentfrom aconjunction.A conjunctionshouldspecifyeithera join oftwo or moremain topicsor a join of two or moresubtopics– it shouldimply conjoiningtwo like items.In contrast,“in thecontext of” canbethoughtof asasubordinatingrelation(seeFigure3.1).

Theideaof themaintopic/subtopicdichotomycanbegeneralizedasfollows: differentdistributionsof term occurrenceshave different semantics;that is, they imply differentthingsabouttheroleof thetermsin thetext.

Considerthechartin Figure3.2. It shows thepossibleinterlinking of distributionsof

Page 91: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 39

otherapproachwhich attemptsto find documentsthat aremostsimilar to a queryor tooneanotherbasedon thetermsthey contain.In similarity search,thebestoverall matchesarenot necessarilytheonesin which thelargestpercentageof thequerytermsarefound,however. For example,givena querywith 30 termsin it, thevectorspacemodelpermitsa documentthatcontainsonly a few of thequerytermsto be rankedvery highly if thesewordsoccurinfrequentlyin thecorpusasa wholebut frequentlyin thedocument.

In thevectorspacemodel(Salton1988),aquery’stermsareweightedandplacedinto avectorthat is comparedagainstvectorsrepresentingthedocumentsof thecollection. Theunderlyingassumptionis thatdocuments’contentcanberepresentedin a geometricspaceandtherelative distancebetweentheir vectorsrepresentstheir relative semanticdistance.In probabilisticmodels(vanRijsbergen1979),thegoalis to rankthedatabaseof documentsin orderof their probabilityof usefulnessfor satisfyingtheuser’s statedinformationneed.However, in practicethesesystemsalso representqueriesanddocumentswith weightedtermsandtry to predicttheprobabilityof relevanceof adocumentto aqueryby combiningthescoresof theweightedterms.

In Booleanretrieval a queryis statedin termsof disjunctions,conjunctions,andnega-tionsamongsetsof documentsthatcontainparticularwordsandphrases.Documentsareretrieved whosecontentssatisfy the conditionsof the Booleanstatement.The userscanhavemorecontroloverwhattermsactuallyappearin theretrieveddocumentsthanthey dowith similarity search.However, a drawbackof Booleanretrieval is thatin this frameworkno rankingorder is specified. This problemis sometimesassuagedby applyingrankingcriteriaasusedin similarity searchto theresultsof theBooleansearch(Fox & Koll 1988).

Most informationretrieval similarity measurestreatthetermsin adocumentuniformlythroughout. That is, a term’s weight is the sameno matterwhereit occursin the text.1

Many researchersassumethis is a valid assumptionwhenworkingwith abstracts,sinceitis a fair approximationto saythat the locationof thetermdoesnot significantlyeffect itsimport. Thesecommentsapplyaswell to shortnews articles,anothertext typecommonlystudiedin informationretrieval research.

Althoughthereareotherapproaches,suchasknowledge-basedsystems,e.g.,McCuneet al. (1985),Funget al. (1990),Mauldin(1991),DeJong(1982),whichattemptto interpretthetext tosomedegree,andsystemsthatattempttoanswerquestions,e.g.,O’Connor(1980)andKupiec(1993),thebulk of informationretrieval researchhasfocusedon satisfyingaquery that canbe paraphrasedas: “Find moredocumentslike this one.” This a naturalway to phrasea query, andis perhapsoneof themoreaccessibleto formalization,but it iscertainlynot theonly usefulquestionto allow a userto ask. In thenext sectionI describewhy alternativesto thequery“Find moredocumentslike this one” shouldbe consideredfor full-text informationaccess,andoutlineanalternativeviewpointonhow to retrieveanddisplayinformationfrom full-text documents.

1Smallwindowsof adjacency informationaresometimesusedin Booleansystems,but notin probabilisticor vector-spacemodels.Therecentexperimentsof Keen(1991),Keen(1992)areanexceptionto this.

Page 92: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER3. TERM DISTRIBUTION IN FULL-TEXT INFORMATION ACCESS 38

to oneanothercanplay a role in determiningthe potentialrelevanceof a documentto aquery.

In this chapterI emphasizethe importanceof relative term distribution informationin informationaccessfrom full-text documents.The chapterfirst discussesthe standardinformationretrieval rankingmeasures.It thensuggeststhatbecausethemakeupof longtexts is qualitativelydifferentfrom thatof abstractsandshorttexts,thestandardapproachesarenotnecessarilyappropriatefor longtexts. Sinceacritical aspectof longtext structureisthepatternof termdistribution,I enumeratethepossibledistributionrelationsthatcanholdbetweentwo setsof terms,andmakepredictionsabouttheusefulnessof eachdistributiontype.

I thenpointout thatexisting approachesto informationaccessdo notsuggesta waytousethisdistributionalinformation.Furthermore,standardrankingmechanismsareopaque;usersdo not know what role their query terms played in the ranking of the retrieveddocuments.Thisproblemis exacerbatedwhenretrievingagainstfull-text documents,sinceit is lessclearhow thetermsin thequeryrelateto thecontentsof alongtext thananabstract.

An analogoussituationarisesin theuseof querylanguages:in bothcasesthesituationcanbeimprovedby makinginformationvisible andexplicit to the largestextentpossible(while avoiding complexity). A seriousattitudetowardconsiderationsof clarity andcon-cisenessleadsto aninformationaccessparadigmin which thequeryspecificationandtheresultsof retrieval areintegrated,andtherelationshipsbetweenthequeryandtheretrieveddocumentsaredisplayedclearly.

Towardtheseends,I introducea new displayparadigm,calledTileBars,which allowstheuserto simultaneouslyview therelative lengthof theretrieveddocuments,therelativefrequency of thequeryterms,andtheirdistributionalpropertieswith respecttothedocumentandeachother. I show TileBarsto be a usefulanalyticaltool for determiningdocumentrelevancewhenappliedto samplequeriesfrom theTRECcollection(Harman1993),andIsuggestusingthistool to helpexplainwhy standardinformationretrievalmeasuressucceedor fail for agivenquery.

I alsodiscussgeneralissuesin passageretrieval. No testcollectionsexist for passageretrieval, andin generalthe issuehasnot beenwell-defined.Therefore,I suggestthat theissuesof relative distribution of termsandcontext from which thepassageis extractedbetakeninto accountin thedevelopmentof a testcollectionfor passageretrieval.

3.2 Background: Standard Retrieval Techniques

Thepurposeof informationretrieval is to developtechniquesto provideeffectiveaccessto largecollectionsof objects(containingprimarily text) with thepurposeof satisfyingauser’s statedinformationneed(Croft & Turtle 1992). Themostcommonapproachesforthis purposeareBooleanterm retrieval andsimilarity search. I usethe term “similaritysearch”asanumbrellatermcovering thevectorspacemodel(Salton1988),probabilisticmodels(van Rijsbergen 1979), (Cooperet al. 1994), (Fuhr & Buckley 1993), and any

Page 93: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

37

Chapter 3

Term Distribution in Full-TextInformation Access

3.1 Introduction

As mentionedin Chapter1,mostinformationretrievalmethodsarebettersuitedto titlesandabstractsthanfull text documents.In thischapter, I arguethattheadventof full-lengthtext shouldbeaccompaniedby correspondingnew approachesto informationaccess.Mostimportantly, I emphasizethat even more thanshort text, full text requires context: termcontext is importantin computingretrieval rankingsandin displayingretrievedpassagesanddocuments.

Informationaccessmechanismsshouldnot bethoughtof asretrieval in isolation. Themechanismsfor queryingaswell asdisplayareintimatelytiedwith theretrievalmechanism,whetherthe implementorrecognizesthis or not. Cutting et al. (1990:1)advocatea textaccessparadigmthat “weaves togetherinterface,presentationand searchin a mutuallyreinforcingfashion”;this viewpoint is adoptedhereaswell.

In Hearst& Plaunt(1993),wesuggestthatin theanalysisof full-lengthtextsadistinctionshouldbemadebetweenmaintopicsandsubtopics,andwesuggestthatusersbeallowedtospecifyasearchfor asubtopicwith respect to somemaintopic. To seewhy thisdistinctionmight be useful,considerthe following scenario:A userwould like to find a discussionof funding for cold fusion research. Thereis a long text aboutcold fusion that hasatwo-paragraphdiscussionof fundingtwo-thirdsof theway in. This discussion,becauseitis in thecontext of a documentaboutcold fusion,doesnot mentionthe termcold fusionanywherenearthediscussionof funding. A full-documentretrieval will eitherassignlowrankto thisdocumentbecausefunding-relatedtermsareinfrequentrelativeto thewhole,orelseit will assignhigh rankto any articlesaboutcold fusion. A retrieveagainstindividualparagraphsor segmentswill eitherassignlow rankto thisdocumentbecauseit will seeonlyfundingtermsbut no cold fusiontermsin therelevantsegment,or it will givehigh ranktoany documentsthathavediscussionsof funding.Thusthedistributionof termswith respect

Page 94: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 36

of theTocqueville chapter. In several casesin which thealgorithmseemsto be off, it isthe resultof the fact that theactualtransitiontakesplacemid-paragraph.This is perhapsan argumentfor looseningthe restrictionof TextTiling into non-overlappingtext units,especiallywhenusedfor thepurposesof userinterfacedisplay.8

2.9 Conclusions

Thischapterhasdescribedalgorithmsfor thesegmentationof expositorytextsintomulti-paragraphdiscourseunits that reflectthesubtopicstructureof thetexts. It hasintroducedthe notion of the recognitionof multiple simultaneousthemesas an approximationtoSkorodch’ko’sPiecewiseMonolithic text structuretype.Thealgorithmisfully implementedandtermrepetitionalone,withoutuseof thesauralrelations,knowledgebases,or inferencemechanisms,workswell for many of theexperimentaltexts.

Thechainingalgorithmvariationis adaptedfrom thatof Morris & Hirst (1991),withthefollowingdifferences:(i) thescoresfrom multiplesimultaneouschainsarecombinedattheboundaryof eachsentence(or token-sequence)andusedto determinewheresegmentbreaksshouldbemade,(ii) no thesaurustermsareused,and(iii) nochainreturnsareusedto determineif a chainthatbrokeoff restartedlater. This algorithmseemscomparabletotheblockalgorithm;in bothcases,onealgorithmperformsbetterthantheotheronsomeofthe testtexts. This maywell occurbecausebothalgorithmsmakeuseonly of lexical co-occurrenceinformation,andtheevidencefor boundariesgivenby thiskind of informationis impoverishedcomparedto thephenomenait triesto accountfor. Furthermore,thereaderjudgmentdatabeingusedasa yardstickis not terribly reliablesinceagreementamongthejudges,althoughsignificantat frequency four accordingto themeasureof Passonneau&Litman (1993),is still ratherlow. Apparentlythereis morethanoneway to tile a text, asindicatedby disagreementamongjudgesandalgorithms.Furthermore,in bothversionsofthe algorithm,changesto the parametersof the algorithmperturbthe resultingboundarymarkings.This is anundesirablepropertyandperhapscouldberemediedwith somekindof information-theoreticformulationof theproblem.9

Theseissuesare not too damagingif the resultsare useful. Chapter3 describesanew informationaccessframework which usesthe resultsof theblock tiling algorithmtodeterminewhethertermsin a queryoverlap in a passage.Although no attemptis madethereto show formally that the tiles perform better than randomlydivided texts (sinceplatformsfor evaluationof suchinformationdo not currentlyexist), informal interactionswith that systemindicatethat when tiling is correctthe resultsof the systemare betterthanwhentiling is incorrect.This indirectevidenceimpliesthatthetechnique,despitethedisagreementin judgmentsamongreadersandtheerrorsin the algorithmitself, is betterthanarbitrarilydividedtextsor paragraphsalone.

8I amgratefulto JanPedersenfor thisobservation.9This ideawassuggestedby GraemeHirst andAndreasStolcke.

Page 95: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 35

0 20 40 60 80 100 120 140

1 2 3 4 56 7 8 9 1011 12 13 1415 1617 18 19 20 21 222324 2526 27 28

Figure2.10:Anotherview of theresultsof theblockTextTiling algorithmontheTocquevillechapter. Thebottomrow correspondsto aninterpretationof Tocqueville’s subtopiclabels,the top row correspondsto the output of the algorithm. The internal numbersindicateparagraphgapnumbers,andthex-axiscorrespondsto token-sequencegapnumber.

chapter, weseethattheresultsarebetterthanthesenumbersmight indicate.For example,sincethereis a mentionof prairiesin the subtopiclist, I have chosen

to specifya breakbetweenparagraphs19 and20, despitethe fact that paragraph19 is acontinuationof thediscussionof forestsandhasonly thebarestmentionof prairies. Thealgorithmproducesa healthypeakcorrespondingto the focuson woodlandsandflora ofparagraphs17 - 19. The stretchof paragraphs21 - 25 is brokeninto two peaksby thealgorithm,thefirst correspondingto a discussionof thecharacteristicsof apeople,andthesecondcorrespondingto a comparisonbetweenEuropeansandthesepeople.

Thediscussioncorrespondingto “Valley of theMississippi”wasassignedparagraphs7- 9, althoughmostof thediscussion,with theexceptionof thefirst sentenceof paragraph7, refers to the river more than to the valley. Correspondingly, the plot in Figure 2.9risesmidwaythroughthediscussionof paragraph7 andtheprogramhasto makea choicebetweenmarkingthe boundaryfollowing paragraph6 or paragraph7. Sinceneitheronecorrespondsdirectly to thevalley in theplot, thedecisiongoesto gapwith thesharperriseononeside.

Anotherexampleof thecontentof theparagraphsnotcorrespondingto their form, I’ vemarkedparagraphs10and11ascorrespondingto “Tracesfoundthere[in theValley of theMississippi]of therevolutionsof theglobe”. However, thediscussionof therivercontinuesaboutonethird of thewaythroughparagraph10,afterwhichthediscussionof theprimevaloceanstartsup. Thispatternis reflectedin theplot of Figure2.9.

Finally, thealgorithmdoesnotmarka boundarybetweenparagraphs11 and12. Thereis a dip in theplot following paragraph12 (which is off by onesentencefrom thedesiredboundary, after11),but therestrictiononallowing verycloseneighborspreventsthis frombeingmarked,dueto paragraph12’s proximity to 13.

Overall, then,thealgorithmdoesquitewell at identifyingthemainsubtopicboundaries

Page 96: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 34

01-06 North Americadivided into two vastregions,onein-clining towardsthePole,theothertowardstheEquator

07-09 Valley of theMississippi10-11 Tracesfoundthereof therevolutionsof theglobe12-13 Shoreof the Atlantic Ocean,on which the English

colonieswerefounded14-16 Dif ferentaspectsof Northandof SouthAmericaat the

timeof theirdiscovery17-18 Forestsof NorthAmerica19-19 Prairies20-20 [The tribes’] outwardappearance,customs,and lan-

guages21-25 Wanderingtribesof natives26-28 Tracesof anunknown people.

Figure2.8: Paragraph-level breakdown of thesubtopicstructureof Tocqueville Ch. 1 Vol.1, repeatedherefor convenientreference.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 20 40 60 80 100 120 140 160

12 3 4 56 7 8 9 1011 12 13 1415 16 17 18 19 20 21 2223 24 2526 27 28 29

Figure2.9: Resultsof theblocksimilarity algorithmonChapter1,Volume1 of Democracyin America. Internalnumbersindicateparagraphgapnumbers(e.g.,thenumber’10’ indi-catesthattheboundaryfalls betweenparagraphs9 and10),x-axisindicatestoken-sequencegapnumber, y-axisindicatessimilaritybetweenblockscenteredatthecorrespondingtoken-sequencegap.Vertical linesindicateboundarieschosenby thealgorithm.

Page 97: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 33

theaverageof thescoresof theserunswasfound.The algorithmsareevaluatedaccordingto how many true boundariesthey selectout

of the total selected(precision)andhow many trueboundariesarefoundout of the totalpossible(recall)(Salton1988).Therecallmeasureimplicitly signalsthenumberof missedboundaries(falsenegatives,or deletionerrors);thetablealsoindicatesthenumberof falsepositives,or insertionerrors,explicitly. The precisionandrecall for the averageof theresultsappearin Table2.1(resultsat33%arealsoshown for comparisonpurposes).

I alsocomparedthe coreTextTiling algorithmagainstthe chainingalgorithmvariantdiscussedin Section2.6.4.Thebestvariationon thechainingalgorithmallowsgapsof upto six token-sequencesbeforethechainis consideredto bebroken.For bothalgorithms,wis 20,andmorphologicalanalysisanda stoplistareapplied,asdescribedin Section2.6.1.

Table2.1showsthattheblockingalgorithmissandwichedbetweentheupperandlowerbounds. The block similarity algorithmseemsto work slightly betterthanthe chainingalgorithm,althoughthedifferencemaynot prove significantover the long run. Table2.2showssomeof theseresultsin moredetail.

In many casesthe algorithmsarealmostcorrectbut off by oneparagraph,especiallyin the texts that the algorithmperformspoorly on. Whenthe block similarity algorithmis allowedto beoff by oneparagraph,thereis dramaticimprovementin thescoresfor thetexts that lower partof Table2.2, yielding anoverall precisionof 83%andrecallof 78%.As in Figure2.7, it is oftenthecasethatwherethealgorithmis incorrect,e.g.,paragraphgap11, theoverall blockingis verycloseto whatthejudgesintended.

2.8 An Extended Example: The Tocqueville Chapter

Thissectionillustratestheresultsof TextTiling onChapter1,Volume1 of Tocqueville’sDemocracy in America discussedin Section2.3.3.As mentionedthere,thistext is interest-ing becausetheauthorhasprovideda subtopic-likestructurein thechapterpreamble.Thetext of the chapter, labeledwith paragraphnumbersandsectioninginformationfrom thetiling algorithm,appearsin AppendixA. Theparagraph-level breakdown of thesubtopicdescriptionsis reproducedin Figure2.8for convenientreferenceandFigure2.9shows thecorrespondingplot producedby theTextTiling algorithm.Notethatthelasttwoparagraphsin thetext aresummaryin nature,andarenot referredto in thesubtopiclist.

Comparingthe resultsof tiling againstthesubtopiclist of Figure2.8, we seethat thealgorithmisgenerallysuccessful.However, it doesmakesomeoff-by-oneerrorsandinsertsat leastoneboundarythat is not specifiedby the subtopiclist. Figure2.10comparestheresultsof thealgorithmto thatspecifiedin Tocqueville’s subtopiclist accordingto token-sequencegapnumber(thefinal paragraphsarenot shown sincethey arenot referredto inTocqueville’ssubtopiclist). Usingtheprecision/recallmeasuresof theprevioussectionweseethataccordingto theseboundariesthealgorithmcorrectlychooses6/9 of thepossibleboundaries(recall= 67%),andof theboundariesit chooses,6/9werealsochosenaccordingto thesubtopicstructure(precision= 67%). Looking at Figure2.10andat the text of the

Page 98: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 32

Precision Recallavg sd avg sd

Baseline33% .44 .08 .37 .04Baseline41% .43 .08 .42 .03Chains .64 .17 .58 .17Blocks .66 .18 .61 .13Judges .81 .06 .71 .06

Table2.1: PrecisionandRecallvaluesfor 13 testtexts.

Total Baseline41%(avg) Blocks Chains Judges(avg)Text Possible Prec Rec C I Prec Rec C I Prec Rec C I Prec Rec C I

1 9 .44 .44 4 5 1.0 .78 7 0 1.0 .78 7 0 .78 .78 7 22 9 .50 .44 4 4 .88 .78 7 1 .75 .33 3 1 .88 .78 7 13 9 .40 .44 4 6 .78 .78 7 2 .56 .56 5 4 .75 .67 6 24 12 .63 .42 5 3 .86 .50 6 1 .56 .42 5 4 .91 .83 10 15 8 .43 .38 3 4 .70 .75 6 2 .86 .75 6 1 .86 .75 6 16 8 .40 .38 3 9 .60 .75 6 3 .42 .63 5 8 .75 .75 6 27 9 .36 .44 4 7 .60 .56 5 3 .40 .44 4 6 .75 .67 6 28 8 .43 .38 3 4 .50 .63 5 4 .67 .75 6 3 .86 .75 6 19 9 .36 .44 4 7 .50 .44 4 3 .60 .33 3 2 .75 .67 6 210 8 .50 .38 3 3 .50 .50 4 3 .63 .63 5 3 .86 .75 6 111 9 .36 .44 4 7 .50 .44 4 4 .71 .56 5 2 .75 .67 6 212 9 .44 .44 4 5 .50 .56 5 5 .54 .78 7 6 .86 .67 6 113 10 .36 .40 4 7 .30 .50 5 9 .60 .60 6 4 .78 .70 7 2

Table 2.2: Scoresby text, showing precisionand recall. (C) indicatesthe numberofcorrectlyplacedboundaries,(I) indicatesthenumberof insertedboundaries.Thenumberof deletedboundariescanbedeterminedby subtracting(C) from Total Possible.

Thefinal paragraphis asummaryof theentiretext; thealgorithmrecognizesthechangein terminologyfromtheprecedingparagraphsandmarksaboundary;onlytwoof thereaderschoseto differentiatethesummary;for this reasonthealgorithmis judgedto have madean error even thoughthis sectioningdecisionis reasonable.This illustratesthe inherentfallibility of testingagainstreaderjudgments,althoughin part this is becausethe judgesweregivenlooseconstraints.

Following theadviceof Galeet al. (1992a),I comparethealgorithmagainstbothupperandlowerbounds.Theupperboundin thiscaseis theaveragesof thereaderjudgmentdata.Thelowerboundisabaselinealgorithmthatisasimple,reasonableapproachto theproblemthatcanbeautomated.In thetestdata,boundariesareplacedin about41%of theparagraphgaps.A simplewayto segmentthetexts is to placeboundariesrandomlyin thedocument,constrainingthe numberof boundariesto equalthat of the averagenumberof paragraphgapsassignedby judges.A programwaswritten thatplacesa boundaryrandomlyat eachpotentialgap41%of thetime,wasrun a largenumberof times(10,000)for eachtext, and

Page 99: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 31

-1

0

1

2

3

4

5

6

7

0 10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40 50 60 70 80 90 100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(b)

Figure2.7: (a)Judgmentsof sevenreadersontheStargazer text. Internalnumbersindicatelocationof gapsbetweenparagraphs;x-axis indicatestoken-sequencegapnumber, y-axisindicatesjudgenumber, a breakin a horizontalline indicatesa judge-specifiedsegmentbreak.(b) Resultsof theblocksimilarity algorithmontheStargazer text. Internalnumbersindicateparagraphnumbers,x-axisindicatestoken-sequencegapnumber, y-axisindicatessimilarity betweenblocks centeredat the correspondingtoken-sequencegap. Verticallines indicateboundarieschosenby thealgorithm;for example,the leftmostvertical linerepresentsa boundaryafterparagraph3. Notehow thesealign with theboundarygapsof(a).

Page 100: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 30

whichthetopicchanged;they werenotgivenmoreexplicit instructionsaboutthegranularityof thesegmentation.

Figure2.7(a)showstheboundariesmarkedby sevenjudgesontheStargazerstext. Thisformathelpsilluminate thegeneraltrendsmadeby thejudgesandalsohelpsshow whereandhow oftenthey disagree.For instance,all but onejudgemarkeda boundarybetweenparagraphs2 and3. Thedissentingjudgedid marka boundaryafter3, asdid two of theconcurringjudges. Thenext threemajorboundariesoccurafter paragraphs5, 9, 12, and13. Thereis somecontentionin thelaterparagraphs;threereadersmarkedboth16and18,two marked18alone,andtwo marked17alone.Theoutlinein Section2.1givesanideaofwhateachsegmentis about.

Passonneau& Litman (1993) discussat lengthconsiderationsaboutevaluatingseg-mentationalgorithmsaccordingto readerjudgmentinformation. As Figure2.7(b)shows,agreementamongjudgesis imperfect,but trendscanbediscerned.In Passonneau& Lit-man’s (1993) data, if 4 or more out of 7 judgesmark a boundary, the segmentationisfound to be significantusinga variationof the Q-test(Cochran1950). My datashowedsimilar results. However, it isn’t clearhow useful this significanceinformationis, sincea simplemajority doesnot provide overwhelmingproof abouttheobjective reality of thesubtopicbreak.Sincereadersoftendisagreeaboutwhereto draw a boundarymarkingforatopicshift, onecanonly usethegeneraltrendsasabasisfrom whichto comparedifferentalgorithms.Sincethegoalsof TextTiling arebetterservedby algorithmsthatproducemoreratherthanfewerboundaries,I setthecutoff for “true” boundariesto threeratherthanfourjudgesperparagraph.6 Theremaininggapsareconsiderednonboundaries.

2.7.2 Results

Figure2.7(b)showsaplot of theresultsof applyingtheblockcomparisonalgorithmtotheStargazer text. Whenthelowermostportionof avalley isnotlocatedataparagraphgap,thejudgmentismovedto thenearestparagraphgap.7 For themostpart,theregionsof strongsimilarity correspondto theregionsof strongagreementamongthereaders.(Theseresultswerefifth highestout of the 13 testtexts.) Notehowever, that thesimilarity informationaroundparagraph12 is weak. This paragraphactsasa summaryparagraph,summarizingthe contentsof the previous threeandrevisiting muchof the terminologythat occurredin themall in one location (in the spirit of a Grosz& Sidner(1986) “pop” operation).Thusit displayslow similarity both to itself andto its neighbors.This is anexampleof abreakdown causedby theassumptionaboutthelinearsequenceof thesubtopicdiscussions.It is possiblethatanadditionalpassthroughthetext couldbeusedto find structureof thiskind.

6Paragraphsof threeor fewer sentenceswerecombinedwith their neighborif thatneighborwasdeemedto follow at “true” boundary, asin paragraphs2 and3 of theStargazers text.

7Theneedfor thisadjustmentmightbeexplainedin partby Stark(1988)whoshowsthatreadersdisagreemeasurablyaboutwhereto placeparagraphboundarieswhen presentedwith texts with thoseboundariesremoved.

Page 101: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 29

x−−x−−−−−xx−−−−−x−−−−−xx−−x−−xx−−−−−−−−xx−−x−−−−−x x−−x−−x−−x x−−x−−−−−x x−−x−−−−−x x−−x−−x x−−x x−−x−−x−−x x

ABCDEFIJKLMN

1 2 3 4 5 6 7 8

Figure2.6: Accumulatingcountsof chainsof terms:letterssignify lexical items,numberssignify token-sequencenumbers,‘x’ indicatesthat thetermoccursin thetoken-sequence,‘-’ indicatescontinuationof achain,andarrowscutthroughtheactivechainsthatcontributeto thecumulativecountfor token-sequencegaps2,4,and6. In thediagramthereisevidencefor abreakbetweentoken-sequences4 and5 becausetherearefew active chainsthere.

2.7 Evaluation

Oneway to evaluatethesesegmentationalgorithmsis to compareagainstjudgmentsmadeby humanreaders,anotheris to seehow well the resultsimprove a computationaltask,anda third possibleevaluationmeasureis to comparethe algorithmsagainsttextspre-markedby authors. This sectioncomparesthe algorithmagainstreaderjudgments,sinceauthormarkupsarefallible andareusuallyappliedto text typesthat this algorithmis not designedfor, andChapter3 shows how to usetiles in a task(althoughit doesnotformally prove that the resultsof the algorithmimprove the taskmore thansomeotheralgorithmwith similargoalswould).

2.7.1 Reader Judgments

Judgmentswereobtainedfromsevenreadersfor eachof thirteenmagazinearticleswhichsatisfiedthe lengthcriteria (between1800and2500words)5 andwhich containedlittlestructuraldemarkation.Thejudgeswereaskedsimplyto marktheparagraphboundariesat

5Onelongertext of 2932wordswasusedsincereaderjudgmentshadbeenobtainedfor it from anearlierexperiment.Note that this representsanamountof testdataon theorderof thatusedin theexperimentsofPassonneau& Litman (1993). Judgesweretechnicalresearchers.Two texts hadthreeor four shortheaderswhichwe removed.

Page 102: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 28

� Usingadifferentsimilarity measure,suchasonethatweightsthetermsaccordingtoa gaussiandistributioncenteredateachtoken-sequencegapnumber.� Treatingtheplot asa probabilistictime seriesanddetectedtheboundariesbasedonthelikelihoodof a transitionfrom nontopicto topic.4

Earlier work (Hearst1993) incorporatedthesauralinformation into the algorithms;surprisinglythe latestexperimentsfind that this informationdegradesthe performance.This could very well be due to problemswith the thesaurusand assignmentalgorithmused(a variation on that describedin Chapter4. A simple algorithm that just positsrelationsamongtermsthatarea smalldistanceapartaccordingto WordNet(Miller et al.1990)orRoget’s1911thesaurus(fromProjectGutenberg),modeledafterMorrisandHirst’sheuristics,mightworkbetter. ThereforeI donotfeeltheissueisclosed,andinsteadconsidersuccessfulgroupingof relatedwordsasfuturework. AsanotherpossiblealternativeKozima(1993)hassuggestedusinga(computationallyexpensive)semanticsimilaritymetrictofindsimilarity amongtermswithin a smallwindow of text (5 to 7 words). This work doesnotincorporatethenotionof multiplesimultaneousthemesbut insteadjust triesto find breaksin semanticsimilarityamongasmallnumberof terms.A goodstrategy maybeto substitutethis kind of similarity informationfor term repetitionin algorithmslike thosedescribedhere.Anotherpossibilitywould beto usesemanticsimilarity informationascomputedinSchutze(1993b),Resnik(1993),or Daganet al. (1993).

The useof discoursecuesfor detectionof segmentboundariesand other discoursepurposeshasbeenextensively researched,althoughpredominantlyon spokentext (seeHirschberg & Litman (1993)for a summaryof six researchgroups’treatmentsof 64 cuewords). It is possiblethat incorporationof suchinformationmayhelp improve thecaseswherethealgorithmis off by oneparagraph,asmight referenceresolutionor anaccountof tenseandaspect. Informal experimentswith versionsof all of the other itemsdo notseemto producesignificantlybetterresultsthanthemoststripped-down versionof thecorealgorithm.

Anotherwayto alterthealgorithmis to changethecomparisonstrategy. It is possibletomodify theapproachof Morris & Hirst (1991),discussedabove, to takemultiple simulta-neousthemesinto account,andto applyit to themulti-paragraphsegmentationproblemasopposedto theattentional/intentionalsegmentrecognitionproblem.Ratherthanassumingthateachchaincorrespondsdirectlyto onesegment,andviceversa,analgorithmcancreateacollectionof activechains,andthenplaceboundariesat thepointsin thetext wheremorechainsareinactivethanactive(seeFigure2.6). Thisapproachdoesnotmakeuseof explicitchainreturns;they areaccountedfor implicitly instead.A versionof Youmans’algorithm(Youmans1991),alsodiscussedabove,andmodifiedto applyto largersegmentationunits,might alsoprove successful,althoughpreliminaryexperimentsdid not show it to performsignificantlybetter.

4I amgratefulto IsabelleGuyonfor herhelpwith this suggestion.

Page 103: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 27

relative heightof thepeakto theleft. (A gapoccurringat a peakwill have a scoreof zerosinceneitherof its neighborsis higherthanit.)

Thesenew scores,calleddepthscores,correspondingto how sharpa changeoccurson bothsidesof the token-sequencegap,aresorted. Segmentboundariesareassignedtothe token-sequencegapswith the largestcorrespondingscores,adjustedasnecessarytocorrespondto trueparagraphbreaks.A proviso checkis donethatpreventsassignmentofvery closeadjacentsegmentboundaries.Currentlytheremustbeat leastthreeinterveningtoken-sequencesbetweenboundaries.This helpscontrol for thefact thatmany texts havespuriousheaderinformationandsingle-sentenceparagraphs.

A consequenceof theboundarydeterminationstrategy is thatatoken-sequencegapthatliesbetweentwo sharplyrisingpeakswill receiveahigherscorethanatoken-sequencegapin themiddleof a longvalley with low hills. Thusa gapwith ahighpeakon only onesidecanreceive agood-sizedscore.A potentialproblemoccursif thereis a riseon onesideofagap,andadeclineontheother. However, thegapat thebottomof thedeclinewill receiveaneven largerscorethanthefirst gapandsowill overrulethefirst gap’s score,if the twogapsareclosetogether. On theotherhandif thetwo gapsarefar apart,thereis probablyacall for theintermediategapto serve asaboundary.

Anotherissueconcernsthenumberof segmentsto beassignedto a document.Everyparagraphis a potentialsegmentboundary. Any attemptto makean absolutecutoff isproblematicsincetherewouldneedto besomecorrespondenceto thedocumentstyleandlength.A cutoff basedon aparticularvalley depthis similarly problematic.

I havedevisedamethodfor determininghow many boundariesto assignthatscaleswiththesizeof thedocumentandis sensitive to thepatternsof similarity scoresthatit produces.Thecutoff is a functionof theaverageandstandarddeviationsof thedepthscoresfor thetext underanalysis.Currentlyaboundaryisdrawn only if thedepthscoreexceeds�h�]��� 2.

2.6.4 Embellishments

Thereareseveralwaysto modify thealgorithmin orderto attemptto improveits results.Someof theseare:

� Varyingthespecificsof tokenization,e.g.,increasingor reducingthestoplistor thedegreemorphologicalanalysis(e.g.,derivationalvs. inflectionalvs. noanalysis)� Usingthesauralrelationsin additionto termrepetitionto makebetterestimatesaboutthecohesivenessof thediscussion.� Usinglocalizeddiscoursecueinformationto helpbetterdetermineexactlocationsofboundaries.� Weightingtermsaccordingto theirprior probability, how frequentthey arein thetextunderanalysis,or someotherproperty.

Page 104: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 26

token-sequence� �X� through� areto thetoken-sequencesfrom � � 1 to �¡� � � 1. Notethatthis moving window approachmeansthateachtoken-sequenceappearsin �_¢ 2 similaritycomputations.

Similarity betweenblocksis calculatedby a cosinemeasure:giventwo text blocks £ 1

and £ 2, eachwith � token-sequences,

� �z¤¦¥g£ 1 § £ 2 ¨1© ª¬«C­ «¯® ° 1 ­ «<® ° 2±ª « ­ 2«<® ° 1 ª³²«µ´ 1 ­ 2«¯® ° 2

where¶ rangesoverall thetermsthathavebeenregisteredduringthetokenizationstep,and

­ «<® ° 1 is theweightassignedto term ¶ in block £ 1. In thecoreversionof thealgorithm,theweightson the termsaresimply their frequency within the block. Thusif the similarityscorebetweentwo blocks is high, then the blocks have many termsin common. Thisformulayieldsascorebetween0 and1, inclusive.

Thesescorescanbeplotted,token-sequencenumberagainstsimilarity score.However,sincesimilarity is measuredbetweenblocks £ 1 and £ 2, where £ 1 spanstoken-sequences� �\� through � and £ 2 spans��� 1 to �>� � � 1, themeasurement’s · -axiscoordinatefallsbetweentoken-sequences¸ and ¸5¹ 1. Therefore,the º -axiscorrespondsto token-sequencegap number .

2.6.3 Boundary Identification

Boundaryidentificationtakesplacein several steps. First, the plot is smoothedwithaveragesmoothing;thatis,

for eachtoken-sequencegap » andanevenwindow size ¼\¹ 1find thescoresof the ¼[½ 2 gapsto theleft of »find thescoresof the ¼[½ 2 gapsto theright of »find thescoreat »taketheaverageof thesescoresandassignit to »¡¾

repeatthis procedure¿ times

In practice,for mostof theexaminedtexts,oneroundof averagesmoothingwith awindowsizeof threeworksbest.

Boundariesaredeterminedby changesin thesequenceof similarity scores.Thetoken-sequencegapnumbersareorderedaccordingto how steeplytheslopesof the plot aretoeithersideof thetoken-sequencegap,ratherthanby their absolutesimilarity score.For agiventoken-sequencegap ¸ , thealgorithmlooksat thescoresof thetoken-sequencegapstotheleft of ¸ aslongaretheirvaluesareincreasing.Whenthevaluesto theleft peakout,thedifferencebetweenthescoreat thepeakandthescoreat ¸ is recorded.Thesameproceduretakesplacewith thetoken-sequencegapsto theright of ¸ ; theirscoresareexaminedaslongasthey continueto rise. The relative heightof the peakto the right of ¸ is addedto the

Page 105: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 25

it looksfor achangein theoverall patternsamongthetermsin theblocksbeingcompared.Thecorealgorithmhasthreemainparts:

1. Tokenization

2. Similarity Determination

3. BoundaryIdentification

Eachis describedin detailbelow.

2.6.1 Tokenization

Tokenizationrefersto the division of the input text into individual lexical units, andis sensitive to the format of the input text. For example, if the documenthasmarkupinformation, the headerandotherauxiliary information is skippeduntil the body of thetext is located.Tokensthatappearin thebodyof the text areconvertedto all lower-casecharactersandcheckedagainsta “stoplist” of 898words,themostfrequenttermsin alargetext collection.If thetokenisastopwordthenit isnotpassedonto thenext step.Otherwise,thetokenis reducedto its rootby amorphologicalanalysisfunctionwhichusesWordNet’snounandverbtermlists andexceptionlists, convertingregularly andirregularly inflectednounsandverbsto their roots.

The text is subdivided into psuedosentencesof a pre-definedsizew (a parameterofthe algorithm)ratherthanactualsyntactically-determinedsentences,thuscircumventingnormalizationproblems. For the purposesof the restof the discussionthesegroupingsof tokenswill be referredto as token-sequences. In practice,settingw to 20 tokenspertoken-sequenceworksbestfor many texts. Themorphologically-analyzedtokenisstoredinatablealongwith arecordof thetoken-sequencenumberit occurredin, andhow frequentlyit appearedin the token-sequence.A recordis alsokeptof the locationsof theparagraphbreakswithin thetext.

2.6.2 Similarity Determination

The next stepis the comparisonof adjacentpairs of blocks of token-sequencesforoverall lexical similarity. (Seethesketchin Figure2.5.) Anotherimportantparameterforthe algorithmis the blocksize: the numberof token-sequencesthat aregroupedtogetherinto a block to be comparedagainstan adjacentgroupof token-sequences.This value,labeledÀ , variesslightly from text to text; asa heuristicit is theaverageparagraphlength(in token-sequences).In practice,a valueof À\Á 6 works well for many texts. Actualparagraphsarenotusedbecausetheir lengthscanbehighly irregular, leadingto unbalancedcomparisons.

Similarity valuesarecomputedfor everytoken-sequencegapnumber;thatis, ascoreisassignedto token-sequencegap ¸ correspondingto how similar thetoken-sequencesfrom

Page 106: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 24

rather than following single threadsof discussionalone. Main topics are themesthatcontinueon throughouttheebbandflow of theinteractingsubtopics.

Many researchers(e.g.,Halliday& Hasan(1976),Tannen(1989),Walker(1991))havenotedthat term repetition is a strongcohesionindicator. In this work, term repetitionalone,whenusedin termsof multiplesimultaneousthreadsof information,is averyusefulindicatorof subtopicstructure.This sectiondescribesthecorealgorithmfor discoveringsubtopicstructureusingtermrepetitionasa lexical cohesionindicator.

The corealgorithmcompares,for a given window size,eachpair of adjacentblocksof text accordingto how similar they arelexically (seeFigure2.5). This methodassumesthat themoresimilar two blocksof text are,themorelikely it is that thecurrentsubtopiccontinues,and,conversely, if twoadjacentblocksof text aredissimilar, thecurrentsubtopicgiveswayto anew one.

A

B

C

D

E

A

C

E

F

G

B

C

F

H

I

A

D

F

I

B

I

J

K

L

M

K

L

M

N

O

J

M

N

P

Q

B

F

J

K

M

1 2 3 4 5 6 7 8

Figure2.5: Illustrationof thecorelexical cohesioncomparisonalgorithm. Letterssignifylexical items,numberssignify sentencenumbers.In thediagram,similarity comparisonisdoneonadjacentblockswith ablocksizeof 2. Arrowsindicatewhichblocksarecomparedto yield scoresfor sentencegaps2, 4, and 6. Blocks are shifted by one sentenceforsimilarity measurementsfor gaps3, 5, and7.

The rationalebehind this strategy is that it is an attemptto detectwhen a dense,interrelateddiscussionendsandanew onebegins,in thespirit of Skorodch’ko’sPiecewiseMonolithic discoursetopology. Theappearanceof a setof new termsindicatestheonsetof a new topic,asin Youmans’approach,but therepetitionof existing termsalsoprovideshelpful evidence– that is, evidencethat thecurrentdiscussionis still ongoing. However,thereis noexplicit requirementabouthow closetogetherindividualtermsmustbe. In otherwords,thealgorithmdoesnotneedto specifyhow far apartindividual termscanbe;rather

Page 107: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 23

shallower valleys thanwith shorterintervals. Strongly influencedby linguistic notions,Youmanstriesto casttheresultingpeaksin termsof coordinationandsubordinationrela-tions,but in thediscussionadmitsthisdoesnotseemlike anappropriateuseof theresults.Youmansdoesnot presentan evaluationof how often the algorithm’s valleys actuallycorrespondto “informationunits”, andleaveshow to usetheresultsto futurework.

2.6 The TextTiling Algorithm

TheTextTiling algorithmcanbedescribedin termsof acoreandacollectionof optionalembellishments.In practicein experimentssofar noneof theembellishmentssignificantlyimprovetheperformanceof thecorealgorithm;thiswill bediscussedin moredetailbelow.I groupthecorealgorithmandits variantstogetherundertherubric of TextTiling.

Many researchershave studiedthe patternsof occurrenceof characters,setting,time,andtheotherthematicfactors,usuallyin thecontext of narrative. In contrast,TextTilingattemptsto determinewherearelatively largesetof activethemeschangessimultaneously,regardlessof thetypeof thematicfactor. This is especiallyimportantin expositorytext inwhich thesubjectmattertendsto structurethediscoursemoreso thancharacters,setting,etc.2 For example,in theStargazers text, a discussionof continentalmovement,shorelineacreage,andhabitabilitygivesway to a discussionof binaryandunarystarsystems.Thisis notsomuchachangein settingor characterasa changein subjectmatter.

This theoreticalstancebearsacloseresemblanceto Chafe’snotionof TheFlow Modelof discourse(Chafe1979),in descriptionof whichhewrites(pp179-180):

Our data Â:Â:Â suggestthat asa speakermovesfrom focusto focus(or fromthoughtto thought)therearecertainpointsat which theremaybe a moreorlessradicalchangein space,time,characterconfiguration,eventstructure,or,even,world. Â:Â:Â At pointswhereall of thesechangein a maximalway, anepisodeboundaryis stronglypresent. But often oneor anotherwill changeconsiderablywhile otherswill changelessradically, andall kinds of variedinteractionsbetweentheseseveralfactorsarepossible.3

AlthoughChafe’swork concernsnarrativetext, thesamekind of observationappliestoexpositorytext. TheTextTiling algorithmsaredesignedto recognizeepisodeboundariesby determiningwherethethematiccomponentslistedby Chafechangein a maximalway.

TheTextTiling algorithmsmakeuseof lexical cohesionrelationsin a mannersimilarto thatsuggestedby Skorochod’ko(1972)to recognizewherethesubtopicchangesoccur.This differs from thework of Morris & Hirst (1991)in severalways,themostimportantof which is thatthealgorithmemphasizestheinteractionof multiplesimultaneousthemes,

2cf. Sibun (1992)for a discussionof how theform of people’sdescriptionsoftenmirror theform of whatthey aredescribing.

3Interestingly, Chafe arrived at the Flow Model after working extensively with, and then becomingdissatisfiedwith, a Longacre-stylehierarchicalmodelof paragraphstructure(Longacre1979).

Page 108: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 22

ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

Ä ÅÆÇ ÅÆÈÅÉÊËÌÊÌËÍÊÍËÎÊÎËÏÊÏËËÊËËÐÊÐËÑÊÑ ËÒÊÒËÓÊÓË

ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

ÌÏ

Ô ÕÖ×ÌÌÌÌÌÌ

ÌÌÌÌÌÌÌÌ

ÒØÈÙ ÅÆÇÙ ØÇ

ÌÌ

ÌÌ

ÌÌÌÌ

ËØÚÛÈÅÌÌÌÌ

Ì

ÍËØÇ ÛÖÌ

Ì

ÌÌÍÍÌÌÌÌÌÍÌÌÌÌÌÌÌÌÌÌ

ËÜÙ ÆÛÖÝ

ÌÌÌÌ

Ì

ÏÇ ÖÙ ÆÛÖÝ

ÌÌÌ

Ì

Ò ÛØÇ ÖÕÆÕ×ÅÖÌ

Ì

ÌÌÌÌÌÌ

ÑÕÖÜÙÇÌ

Ì

ÌÍÌÌ

ÐÚÞßß

ÍÌÌ

ÌÌ

ÌÐÚß ÛÆÅÇÌÌÌÌ

ÌÌ

ÍÌÌÌÌÌÌ

ÌÌ

ÑàÛß ÛáÝÌ

Ì

ÌÌÌÌÌ

Ïß ÞÆÛÖÌÌÌÌ

ÌÓ

ßÙ Ô ÅÌÌÌ

ÌÌÌÌÌÌÌÌ

ÌÌÌÌÌÌÌÌ

ÍÑ ×ÕÕÆÌÎÌÌÌÌÌÌÍÍÍÌÍÌÍÌÌÌÌ

Î×ÕâÅ

ÌÌÌ

Ñ ÈÕÆÇÙ ÆÅÆÇ

ÍÌÌÍÌ

ÎØã ÕÖÅßÙ ÆÅ

ÌÍ

Ð

ÇÙ ×Å

Ì

ÌÌÌÌ

Ì

ÎäÛÇ ÅÖ

ÌÌ

Ì

Ð

ØÛÝ

ÌÌÌÌÌ

Ì

ÎØÚÅÈÙ ÅØ

ÌÌÌ

ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

Ä ÅÆÇ ÅÆÈÅÉÊËÌÊÌËÍÊÍËÎÊÎËÏÊÏËËÊËËÐÊÐËÑÊÑ ËÒÊÒËÓÊÓË

ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

Figure 2.4: Distribution of selectedtermsfrom the Stargazer text, with a single digitfrequency persentencenumber(blanksindicatea frequency of zero).

Page 109: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 21

number, of selectedtermsfrom theStargazers text. Thefirst two termshave fairly uniformdistributionandsoshouldnotbeexpectedto providemuchinformationaboutthedivisionsof the discussion.The next two termsco-occura few timesat the beginning of the text(althoughstaralsooccursquitefrequentlyat theendof thetext aswell), while termsbinarythroughplanethave considerableoverlapfrom sentences58 to 78. Thereis a somewhatwell-demarkedclusterof termsbetweensentences35and50,correspondingto thegroupingtogetherof paragraphs10,11,and12 by humanjudgeswhohave readthetext.

Fromthediagramit is evident thatsimply looking for chainsof repeatedtermsis notsufficient for determiningsubtopicbreaks.Evencombiningtermsthatarecloselyrelatedsemanticallyinto singlechainsis insufficient,sinceoftenseveraldifferentthemesareactivein thesamesegment.For example,sentences37 - 51 containdenseinteractionamongthetermsmove, continent,shoreline,time, species,andlife, andall but the latteroccuronlyin this region. Few thesauriwould groupall of thesetermstogether. However, it is thecasethattheinterlinkedtermsof sentences57- 71(space,star, binary, trinary, astronomer,orbit) arecloselyrelatedsemantically, assumingtheappropriatesensesof the termshavebeendetermined.

Onewayto getaroundthis difficulty is to extendtheMorris algorithmto creategraphsthatplot thenumberof active chainsagainstparagraphor sentencenumbers.This optionis discussedin moredetailin Section2.7.

Youmans

Anotherrecentanalytictechniquethatmakesuseof lexical informationis describedinYoumans(1991).Youmansintroducesavariantontype/tokencurves,calledtheVocabulary-ManagementProfile, that keepstrack of how many first-time usesof termsoccurat themidpoint of each35-wordwindow in a text. Youmans’goal is to studythe distributionof vocabulary in discourseratherthanto segmentit alongtopical lines,but thepeaksandvalleys in the resultingplots “correlatecloselyto constituentboundariesandinformationflow” (althoughYoumanspoints out that they are correlated,but not directly related).Youmansbeginswith thehypothesisthatnew topicswill bemetwith a sharpburstof newtermuses,but this kind of activity is not visible on a typical type/tokenratio plot. Wheninsteadof simpletype/tokenratiosthenumberof new wordswithin an interval of wordsareplotted,thechangesbecomemorevisible.

Youmansdiscovers,uponexaminingmany Englishnarratives,essays,andtranscripts,thatnew vocabulary is introducedlessoftenin thefirst partthanthesecondpartof clausesandsentences,andthatsharpupturnsafter deepvalleys in the curve signalshifts to newsubjectsin essaysandnew episodesin stories.Theanalysisfocuseson morefine-graineddivisionsthan thoseof interestfor TextTiling, subdividing eachparagraphinto multipletopic units. Youmansfindsthatfor certainkindsof texts, theprofile lagsbehindtheonsetof paragraphsfor a sentenceor two, sincemuchexpositorywriting includesrepetitionofinformationfrom oneparagraphinto thenext,

Youmansalsofinds that longer intervals yield smootherplots, with lower peaksand

Page 110: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 20

tionalstructure,andstatesthatthelexical chainsgeneratedby heralgorithmprovideagoodindicationof the segmentboundariesthatGroszandSidner’s theoryassumes.In Morris(1988)andMorris & Hirst (1991),tablesarepresentedshowing thesentencesspannedbythelexical chainsandby thecorrespondingsegmentsof theattentional/intentionalstructure(derivedby hand).Figure2.3showsagraphicaldepictionof thesameinformationfor oneof the test texts. It shows how differentchainscover the structureat different levels ofgranularity, aswell aswhich portionsof thestructurearenotaccountedfor.

Several aspectsof the algorithm areproblematic,especiallywhen appliedto longertexts. First, the algorithmwasexecutedby handbecausethe thesaurusis not generallyavailableonline. However, ProjectGutenberg hasdonatedanonlinecopyof Roget’s 1911thesauruswhich, althoughsmallerandlessstructuredthanthe thesaurususedby Morris,canbeusedfor animplementationof thealgorithm. Asidefrom thefact thatusingsuchathesauruslowersthequality of theconnectionsfoundamongterms,animplementationoftheMorris algorithmusingfound thatoften thechoiceof which thesauralrelationto usewasnot unambiguous.

Second,althoughambiguouschainlinks wererarein Morris’s texts, thetexts analyzedherehadmany ambiguouslinks, even when connectionswere restrictedto beingmadebetweentermsin thesamethesauruscategory. Anotherproblemresultsfrom thefact thatthemodeldoesnot takeadvantageof thetendency for multiplesimultaneouschainsmightoccuroverthesameintention.Forexample,Text 4-3of Morris (1988)containsadiscussionof theroleof womenin theUSSRasembodiedin thelife of RaisaGorbachev. Two differentchainsspanmostof the text: Oneconsistsof termsrelatingto the Soviet Union andtheUnitedStates,andtheotherrefersto women,men,husbands,andwives(seeFigure2.3).

0 5 10 15 20 25 30 35

"chains""intentions"

Figure2.3: Thetargetintentionalstructureandtheextentsof actualchainsfoundin Morris88 for text 4-3. Thex-axis indicatessentencenumbers,they-axis indicatesrelative depthof embeddingof theintentionalstructure.

Another, more seriousproblemariseswhen looking at longer texts: chain overlap.In otherwords,many chainsendat a particularparagraphwhile at the sametime manyotherchainsextendpastthat paragraph.Figure2.4 shows the distribution, by sentence

Page 111: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 19

2.5.2 Lexical Cohesion Relations

Theseminallinguistic work on lexical cohesionrelationsis thatof Halliday & Hasan(1976). In a moreabbreviatedform, Raskin& Weiser(1987)point out thata distinctionmustbe madebetweencohesion andcoherence in a discourse.They state: “Coherencerefersto theconsistency of purpose,voice,content,style, form, andso on of a discourseasintendedby thewriter, achievedin the text, andperceivedby the reader. Cohesion,onthe otherhand,is a textual quality which contributesto coherencethroughverbalcues”(p 48). Onekind of cohesioncueis that of lexical cohesion,which “...resultsfrom theco-occurrenceof semanticallysimilar wordsthatdo not independentlyindicatecohesion”(p 204). Following Halliday& Hasan(1976),they describetwo formsof lexical cohesion,reiteration andcollocation, wheretheformerrefersto repetitionof wordsor theirsynonyms,andthe latter refersto termsthat tendto co-locatein text, e.g.,night andday, or schoolandteacher. Otherkindsof cohesioncuesrelateto specificwordsthat indicateparticularrelations,e.g., afterwardsindicatesa temporalrelation betweensentences,and and canindicatea conjunctive relationship.Relationssuchasanaphoricreferenceareconsideredto begrammaticalcohesion,asopposedto lexical cohesion.

Phillips (1985)suggests“an analysisof the distribution of the selectedtext elementsrelative to eachotherin somesuitabletext interval ... for whatever patternsof associationthey may contractwith eachotherasa function of repeatedco-occurrence”(p 59). Theresultinganalysisleadsto hypothesesof lexical meaningbasedon termco-occurrence,butthetext structureelicitedreflectsnot muchbeyond thechapterstructureof the text bookshe investigates.Two otherimportantapproachesarethoseof Morris & Hirst (1991)andYoumans(1991),describedin thefollowing sections.

Morris and Hirst

Morris and Hirst’s pioneeringwork on computingdiscoursestructurefrom lexicalrelations(Morris & Hirst 1991;Morris 1988)is a precursorto thework reportedon here.Morris, influencedby HallidayandHasan’s theoryof lexical coherence(Halliday& Hasan1976), developedan algorithm that finds chainsof relatedterms via a comprehensivethesaurus(Roget’sFourthEdition). For example,thewordsresidentialandapartmentbothindex thesamethesauralcategory andcanthusbeconsideredto bein acoherencerelationwith oneanother. Thechainsareusedto structuretexts accordingto GroszandSidner’sattentional/intentionaltheoryof discoursestructure(Grosz& Sidner1986),andtheextentof thechainscorrespondto the extentof a segment. Thealgorithmalsoincorporatesthenotionof “chainreturns”– repetitionof termsaftera longhiatus– to closeoff anintentionthatspansover adigression.

SincetheMorris andHirst algorithmattemptsto discover attentional/intentionalstruc-ture,theirgoalsaredifferentthanthoseof TextTiling. Specifically, thediscoursestructurethey attemptto discover is hierarchicaland more fine-grainedthan that discussedhere.Morris (1988)providesfive shortexampletexts for which shehasdeterminedthe inten-

Page 112: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 18

fit the discourseinto a predefinedframeor script, e.g.,Schank& Abelson(1977),Hahn(1990),DeJong(1982),Mauldin (1989). Theseapproachesareusuallyusedto createasummaryof somekind. A variationon the themeis found in case-basedreasoning,e.g.,Kolodner(1983),Bareiss(1989),in whicha discourseis adjustedto fit theexpectationsofasetof pre-analyzeddiscourses.Theproblemwith thiskind of approachis thatit requiresdetailedknowledgeaboutevery domainthat the analyzedtexts discuss,and requiresavery largeamountof processingtime for theanalysisof only a few sentences;impracticalrequirementsfor a full-scaleinformationaccesssystem.

2.5 Detecting Discourse Structure

Many differentmechanismshave beenproposedfor the automateddeterminationofdiscoursestructure. Explicit cuewords,(e.g.,now, well, so in English(Schiffrin 1987))arerecognizedasbeingmeaningfulcues,especiallyfor spokentext. However, thesecuesarenotunambiguousin usage,andconsiderableeffort is requiredto determinetheroleof aparticularinstanceof acue(Hirschberg& Litman1993).Otherkindsof cues,suchastense(Webber1987), (Hwang& Schubert1992), are also informative but requirea complexanalysis. The next two subsectionsdiscusstwo other meansof determiningdiscoursestructure,makinguseof thepatternsof cohesionindicatorsotherthanlexical cohesion,andlexical cohesionrelationsthemselves.

2.5.1 Distributional Patterns of Cohesion Cues

Researchershave experimentedwith the displayof patternsof cohesionindicatorsindiscourseasananalyticdevice, for example,Grimes(1975)(Ch.6) uses“spancharts”toshow the interactionof variousthematicdevicessuchasidentification,settingandtense.Stoddard(1991)creates“cohesionmaps”by assigningto eachword a locationon a two-dimensionalgrid correspondingto theword’s positionin thetext (roughly, eachsentencecorrespondsto a row), andthendrawing a line betweenthelocationof a cohesive elementandthelocationof its original referent.Theresultingmaplookssomewhatlike a columnof hangingpine-needlebunches;thustexts canbe comparedvisually for propertiessuchas burstiness,density, andconnectionspan. Eachkind of cohesive elementis assignedits own map,althoughfor oneexampleall threecohesionmapsaresuperimposed.Herecohesionelementsarepronominalreferents,referentsof definitearticles,andverb agentdisplacements– lexicalcohesionrelationsarenottakenintoaccount.Unfortunately, neitherStoddardnorGrimesanalyzetheresultingpatternsor describehow to usethemto segmentor interpretthetexts.

Page 113: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 17

Chained

Ringed

Monolith

Piecewise

Figure2.2: Skorochod’ko’stext structuretypes.Nodescorrespondto unitsof text suchassentences,andedgesbetweennodesindicatestrongterm overlapbetweenthe text units.Correspondencebetweenpositionof anodeandpositionin thetext dependson thekind ofstructure;this is describedin moredetailin thetext.

together, oneafteranother. This topologymapsnicely ontothatof viewing documentsasa sequenceof denselyinterrelatedsubtopicaldiscussions,one following another. Thisassumption,aswill beseen,is notalwaysvalid, but is neverthelessquiteuseful.

2.4.3 Grammars and Scripts

An alternative way of analyzingdiscoursestructureis to proposea “grammatical”discoursetheory. Many researchershave seenthis as a naturalextensionto the ideasof sentencegrammar. Fillmore (1981:147)makesa distinctionbetweenwhata sentencegrammariandoes(looksfor grammaticalityandnongrammaticality)andwhata discoursegrammariandoes(looksfor sequiturityandnonsequiturity).Wilensky (1983b)alsodisputestheanalogybetweenstorygrammarsandsentencegrammars,arguingthatintuitionsaboutstoriesarecloserto our intuitions aboutthe meaningsof sentencesthanthey areto ourintuitionsaboutsentencesthemselves.

Anotheralternative is to interprettexts from anartificial intelligencestanceandtry to

Page 114: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 16

descriptive tool for analysisof the rhetoricalstructureof text, designedto be usedinautomatedsystems.In RST, text is brokenup into clausalunits,eachof whichparticipatesin a pairwisenucleus/satelliterelationship.Thepairsparticipateascomponentsof largerpairwiseunits, building up a hierarchicaldiscoursedescription. Someof the rhetoricalrelationslinking theunitsare: elaboration,enablement,motivation,andbackground.Theauthorsrecognizethat thereareno reliablegrammaticalor lexical cluesfor automaticallydeterminingthestructure,andoftentherelationscanonly bediscernedby theunderlyingmeaningof thetext. Theanalysisis goal-orientedandmightbelesseffective for texts thatcannotbe describedwell in this manner. RST hasbeenusedin generationsystems,e.g.,Moore& Pollack(1992).

Skorochod’ko’s Topologies

Althoughmany aspectsof discourseanalysisrequireahierarchicalmodel,in thisworkI chooseto castexpositorytext into a linearsequenceof segments,bothfor computationalsimplicity andbecausesuchastructureis appropriatefor coarse-grainedapplications.Thisprocedureis influencedby Skorochod’ko(1972),who suggestsdeterminingthesemanticstructureof a text (for thepurposesof automaticabstracting)by analyzingit in termsof thetopologyformedby lexical interrelationsfoundamongits sentences.

Skorochod’ko(1972) suggestsdiscovering a text’s structureby dividing it up intosentencesandseeinghow muchword overlapappearsamongthesentences.Theoverlapforms a kind of intra-structure;fully connectedgraphsmight indicatedensediscussionsof a topic, while long spindly chainsof connectivity might indicatea sequentialaccount(seeFigure2.2). Thecentralideais thatof definingthestructureof a text asa functionoftheconnectivity patternsof thetermsthatcompriseit. This is in contrastwith segmentingguidedprimarily by fine-graineddiscoursecuessuchasregisterchange,focusshift, andcuewords.Froma computationalviewpoint,deducingtextual topic structurefrom lexicalconnectivity alone is appealing,both becauseit is easyto compute,and also becausediscoursecuesaresometimesmisleadingwith respectto thetopicstructure(Brown & Yule1983)(å 3).

In the Chainedstructure,eachsentencedescribesa new situationor a new aspectofof the topic underdiscussion.Examplesarechronologicaldescriptions,whereoneeventfollows thenext, and“road maps”in thebeginningof technicalpapersoutlining what thefollowing sectionscontain.TheRingedstructureis like theChainedstructureexceptin thelastportionof thediscoursereturnsto whatwasinitially discussed,perhapsasa summarydiscussion.TheMonolith structurerepresentsadenselyinterrelateddiscussion;eachblockcontainsreferencesto termsin the other blocks, indicatingseveral interwoven thematicthreads.ThePiecewise Monolithic structureconsistsof a sequenceof denseinterrelateddiscussions.Skorochod’kodid not definea hierarchicalstructure,perhapsbecauseit isdifficult to identify by usingonly terminterrelations.

Thetopologymostof interestto thiswork is thefinal onein thediagram,thePiecewiseMonolithicStructure,sinceit representssequencesof denselyinterrelateddiscussionslinked

Page 115: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 15

tendto usephrasalor clausalunitsasbuilding blocksfrom which analysesof lengthfromoneto threeparagraphslongaremade(for example,in Morris (1988),intentionalstructureis foundfor texts of approximately40sentencesin length).

Discoursework at themulti-paragraphlevel hasbeenmainly in the theoreticalrealm,notably the work on macrostructures(van Dijk 1980) (van Dijk 1981)and the work onstory grammars(Lakoff 1972),(Rumelhart1975). An exception is the work of Batali(1991)thatmakesuseof discoursestructurein theautomatedinterpretationof (simplified)chaptersof introductoryphysicstexts,with thegoalof learningrulesfor solvingproblemsin kinematics.

2.4.2 Topology of Discourse Structure

Hierarchical Models

Many theoriesof discoursestructure,bothcomputationalandanalytical,assumea hi-erarchicalmodelof discourse.Two prominentexamplesin computationaldiscoursetheoryarethe theoryof attentional/intentionalstructure(Grosz& Sidner1986),andRhetoricalStructureTheory(Mann& Thompson1987).

Grosz& Sidner(1986)presentthebasicelementsof acomputationaltheoryof discoursestructure. The two main questionsthe theory tries to answerare: What individuatesadiscourse?What makesit coherent?They claim the answersare intimately connectedwith two non-linguisticnotions,attentionandintention. Attention is an essentialfactorin explicating the processingof utterancesin discourse. Intentionsplay a primary rolein explaining discoursestructureand defining discoursecoherence. Groszand Sidnerclaimthattheintentionsthatunderliediscoursearesodiversethatapproachesto discoursecoherencebasedonselectingdiscourserelationshipsfromafixedsetof alternativerhetoricalpatternsareunlikely to suffice. (SeeHovy (1990)for a strongcounterview.)

In this theorythelinguisticstructureconsistsof thediscoursesegmentsandanembed-ding relationshipthatcanhold betweenthem. Theembeddingrelationshipsarea surfacereflectionof relationshipsamongelementsof the intentionalstructure.Linguistic expres-sionsareamongtheprimaryindicatorsof discoursesegmentboundaries.Theexplicit useof certainwordsandphrasesandmoresubtlecues,suchasintonationor changesin tenseandaspect,areincludedin therepertoireof linguisticdevicesthatfunctionto indicatetheseboundaries.

Theattentionalstateis modeledby a setof focusspaces;changesin attentionalstatearemodeledby a setof transitionrulesthatspecifytheconditionsfor addinganddeletingspaces.Onefocusspaceassociatedwith eachdiscoursesegment.Thefocusspacehierarchyis different/separatefrom the intentional(task)structure. Passonneau& Litman (1993),following Rotondo(1984),concedethedifficulty of eliciting hierarchicalintentionalstruc-ture with any degreeof consistency from their humanjudges. Not surprisingly, no fullyimplementedversionof this theoryexists.

RhetoricalStructureTheory(RST)(Mann& Thompson1987)is a functionally-based

Page 116: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 14

The natureof the analysiscanbe heavily dependenton whetheror not the theory isgearedtowardsa computationalversusananalyticalframework. An additionalinfluentialfactor is theperceivedrole or purposeof thediscoursestructure.If thegoalof discourseanalysisis to allow thesystemto answerquestionsin aninteractive sessionwith a human,thenissuessuchastheintentionsof thespeakersmustbetakeninto account(e.g.,Wilenskyet al. (1984), Moore & Pollack (1992)). Researchersworking on tutoring and advicesystemsthatengagein dialogueswith humanshave tendedto emphasizepragmatics,e.g.,referenceresolution.Thisusuallyrequiresanunderstandingof issuesrelatingto discoursefocusandcentering. An importantaspectof Winograd’s classicthesiswork (Winograd1972)is his program’sability to determinewhich objectis theonemostlikely to beunderdiscussion.Hedoesthisby incorporatingavarietyof factors,includingthecurrentcontextandfocusof thediscourseaswell asthesemanticsof theobjectsandrelationshipsunderdiscussion(cf. å 8.2). In spoken-text discourseanalysis,focus is usuallystudiedat thesententiallevel, with links amongfoci typically spanningonly a few sentences.Otherexamplesarethe computationalwork of Grosz(1986)andSidner(1983),who examineissuesrelatingto focusandanaphorresolution.

Otherresearchemphasizesthesyntacticaspectsof anaphorresolutionandellipsis,forexample,Dalrympleet al. (1991)andHardt (1992). Anotherapproachis theapplicationof plans,e.g.,Wilensky (1981),Lambert& Carberry(1991)andknowledge,e.g.,Hobbs(1978), Luperfoy (1992), Cardie(1992), to anaphorresolutionand other interpretationtasks.

As isevidentfromthediscussionabove,alargepartof thecomputationaldiscourseworkhasbeendonein thecontext of interactivesystems.In general,thediscoursecharacteristicsof spokentext arequitedifferentfrom thoseof written,especiallyexpository, text (Brown& Yule1983)( å 1.2). Thegoalsof analyzingtextsfor interactivesystemsaredifferentfromthoseof discoursesegmentationof written texts into subtopicalboundaries,andit followsthatthechoiceof discourseunit andtopologydiffer for thedifferenttasks.

2.4.1 Granularity of Discourse Structure

Thereis atraditionin linguisticsof viewing discoursestructureasthestudyof relationsat the interphrasalor interclausallevel. Thenotion of the given/new (or topic/comment)distinction is an extensively studiedonein linguistics. In English, topics, in this sense,areusuallysubjectsandcommentsarethe associatedpredicates.In somelanguagesthedistinctionis markedmoreovertly (Kuno1972),(Grimes1975). This is closelyrelatedtothedistinctionsof theme/rhemeandgiven/new at thesententiallevel.

Work on prosodicstructureof spokentext usually takesplaceat the inter-sententiallevel, e.g.,Wang& Hirschberg (1992),Bachenkoet al. (1986).As mentionedabove,workin anaphoraresolutiontendsto focusonintra-sententialunits,asdoesmosttext-generationwork.

Thehierarchicaltheoriesof discoursesuchasthetheoryof attentional/intentionalstruc-ture (Grosz& Sidner1986),andRhetoricalStructureTheory(Mann& Thompson1987)

Page 117: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 13

createsynopses(e.g.,Chen& Withgott (1992),Pollock & Zamora(1975))currentlydonot usuallytakethis kind of informationinto account. Paice(1990)recognizesthe needfor takingtopicalstructureinto accountbut doesnot have a methodfor determiningsuchstructure.

An interestingalternativeapproachappearsin theworkof Alterman& Bookman(1990).Theauthorsapplyknowledge-intensivetechniquesto interpretshorttextsandthenplot thenumberof inferencesthatcanbemadeagainsttheclausalpositionin thetext. They usetheresultingplot to determinethe“thickness”of thetext ateachpoint; breaksin thicknessin-dicateanepisodechange.Summariesareproducedby findingthemainepisodeboundariesandextractingconceptsfrom eachepisodethat is deemedto be important(usinganothermeasure).Althoughthetechniqueis heavily knowledge-orientedandcomputationallyex-pensive,andthelengthof eachepisodeis abouttwo sentencesonaverage,thegeneralideabearssomeresemblanceto thatdiscussedbelow.

Turning now to the relatedtopic of text generation,Mooney et al. (1990)assertthatthehigh level structureof extendedexplanationsis determinedby processesseparatefromthosewhich organizetext at lower levels. They presenta schemefor text generationthat is centeredaroundthe notion of BasicBlocks: multi-paragraphunits of text, eachof which consistsof (1) an organizationalfocussuchasa personor a location,and (2)a setof conceptsrelatedto that focus. Thustheir schemeemphasizesthe importanceoforganizingthehighlevel structureof a text accordingto its topicalcontent,andafterwardsincorporatingthe necessaryrelatednessinformation,as reflectedin discoursecues,in afiner-grainedpass.This useof multi-paragraphunits for coherentgenerationimpliesthatthisunit of segmentationshouldbeusefulin recognitiontasksaswell.

2.4 Discourse Structure

Whenanalyzingtextualdiscoursestructure,two importantandrelatedissuesare:whatkind of structureis inherentin discourse,andwhatmechanismsandaspectsof languageareneededto detectthatstructure.Althoughthesecondis stronglyinfluencedby thefirst, it isnotunambiguouslydeterminedby thefirst; thatis, onekind of structurecanberecognizedvia lexical distribution patterns,isolateddiscoursecues,andother factors,with varyingdegreesof success.

Two importantsubissuesarisewith respectto the choiceof assumptionsabout thestructureof discourse:

1. At what level of granularityare the units of the discourse?Is the salientunit theword,phrase,clause,sentence,paragraph,or somethingelse?Is morethanonelevelof granularityappropriate?

2. What is the topologyof the discoursestructure?I.e., what form do thepatternsofinterrelationsamongtheunitsof thediscoursestructuretake?

Page 118: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 12

01-06 North Americadivided into two vastregions,onein-clining towardsthePole,theothertowardstheEquator

07-09 Valley of theMississippi10-11 Tracesfoundthereof therevolutionsof theglobe12-13 Shoreof the Atlantic Ocean,on which the English

colonieswerefounded14-16 Dif ferentaspectsof Northandof SouthAmericaat the

timeof theirdiscovery17-18 Forestsof NorthAmerica19-19 Prairies21-25 Wanderingtribesof natives20-20 Theiroutwardappearance,customs,andlanguages26-28 Tracesof anunknown people.

Figure2.1: Paragraph-level breakdown of thesubtopicstructureof Tocqueville Chapter1,Volume1.

colonieswerefounded– Dif ferentaspectsof North andof SouthAmericaatthetimeof theirdiscovery – Forestsof NorthAmerica– Prairies– Wanderingtribesof natives– Theiroutwardappearance,customs,andlanguages– Tracesof anunknown people.

Thesedescriptionscanbeconstruedto besubtopicaldiscussionsthat takeplacein thecontext of a discussionof the exterior form of North America. The list closely reflectsthe orderof discussionof the subtopicsin the ensuingchapter, with a few exceptionsoforderswitchingsandparagraphswhosecontentplaysabridgingroleandsodoesnotmeritmentionin thesubtopiclist. Figure2.1below shows thatthesubtopicdiscussionsin mostcasesspanmorethanoneparagraph.Although the paragraphsin andof themselvesaresomewhatencapsulated,this exampledemonstratesthat themulti-paragraphunit sizecanindeedbeameaningfulone.

A scanof thesubtopicdiscussionsmakesit apparentthat the title of thechapterdoesnot adequatelycover thecontentsof thetext. A discussionof theearly inhabitantsof thecontinentis not somethingonetendsto classifyascentralto its exterior form. The titlemight betterbe served as“Exterior Form andEarly Inhabitantsof North America”. Theassumptionthat a logical text unit must discussonly one topic might be at leastpartlyresponsiblefor themistitle.

Multi-paragraphsubtopicstructureshouldact asa first steptoward automaticdeter-minationof text synopses.Algorithms that extract salientphrasesfrom texts in orderto

Page 119: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 11

2.3.2 Online Text Display and Hypertext

Researchin hypertext and text display hasproducedhypothesesabouthow textualinformationshouldbe displayedto users.Onestudyof anonline documentationsystem(Girill 1991)compareddisplayof fine-grainedportionsof text (i.e., sentences),full texts,andintermediatesizedunits. Girill foundthatdivisionsat thefine-grainedlevel werelessefficient to manageandlesseffective in deliveringusefulanswersthanintermediatesizedunitsof text. (Girill alsofoundthatusingdocumentboundariesismoreusefulthanignoringdocumentboundaries,asis donein somehypertext systems.)Theauthordoesnot makeacommitmentaboutexactly how largethedesiredtext unit shouldbe,insteadtalkingabout“passages”anddescribingpassagesin termsof thecommunicative goalsthey accomplish(e.g.,a problemstatement,an illustrative example,an enumeratedlist). The implicationis that theproperunit is theonethatgroupstogetherthe informationthatperformssomecommunicativefunction;in mostcasesthisunit will rangefrom oneto severalparagraphs.(Girill implies thatpre-markedsectionalinformation,if availableandnot too long, is anappropriateunit.)

Tombaughet al. (1987)explore issuesrelatingto easeof readabilityof long texts onCRT screens. Their study explores the usefulnessof multiple windows for organizingthe contentsof long texts, hypothesizingthat providing readerswith spatialcuesaboutthe locationof portionsof previously readtexts will aid in their recall of the informationand their ability to quickly locateinformation that hasalreadybeenreadonce. In theexperiment,thetext is dividedinto pre-markedsectionalinformation,onesectionplacedineachwindow. They concludethatsegmentingthetext by meansof multiplewindows canbevery helpful if readersarefamiliar with themechanismssuppliedfor manipulatingthedisplay.

Converting text to hypertext in what is calledpost-hocauthoring(Marchionini et al.1991)requiresdivisionof theoriginal text into meaningfulunits(a tasknotedby theseau-thorstobeachallengingone)aswell asmeaningfulinterconnectionof theunits. Automatedmulti-paragraphsegmentationshouldhelpwith thefirst stepof this process.

2.3.3 Text Summarization and Generation

Nineteenthcenturyhistoriesandtraveloguesoftenprefacedchapterswith alist of topicaldiscussions,providing aguidefor thereaderasto thecontentsto come.Thesedescriptionsarenotabstractedsummaries,but ratherarelistsof thesubdiscussionsthattakeplaceduringthecourseof thechapter. For example,Chapter1 of Alexis deTocqueville’sDemocracy inAmerica, Volume 1 is entitled“Exterior Form of North America” andis prefacedwith thefollowing text:

North Americadividedinto two vastregions,oneinclining towardsthePole,theothertowardstheEquator–Valley of theMississippi–Tracesfoundthereoftherevolutionsof theglobe–Shoreof theAtlanticOcean,onwhichtheEnglish

Page 120: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 10

corpora.Severalsuchalgorithmsmakeuseof informationaboutlexical co-occurrence;thatis, they counthow oftentermsoccurnearoneanotheracrossmany texts.

Someof thesealgorithmsuseonly very local context. For example,working withlarge text collections,Brent (1991)andManning(1993)makeuseof restrictedsyntacticinformationto recognizeverbsubcategorizationframes,Smadja& McKeown (1990)createcollectionsof collocationsby gatheringstatisticsaboutwordsthat co-occurwithin a fewwords of one another, and Church& Hanks(1990) usefrequency of co-occurrenceofcontentwordsto createclustersof semanticallysimilarwords.

However, severalalgorithmsgatherco-occurrencestatisticsfrom largewindowsof text,usuallyof fixed length. For example,thedisambiguationalgorithmsof Yarowsky (1992)andGaleet al. (1992b)trainon large,fixed-sizedwindowsof text. In thesealgorithms,alltermsthat residewithin a window of text aregroupedtogetherto supplyevidenceaboutthe context in which a word senseoccurs. For example,an instanceof the tool senseofthewordcranemightbesurroundingby termsassociatedwith largemechanicaltools,suchas lift andconstruction. Termssurroundingthe bird sensewould tendto be thosemoreassociatedwith birddom.A questionarisesabouthow muchcontext surroundingthetargetwordshouldbeincludedin theassociation.Galeet al. (1992b)haveshown that,at leastinonecorpus,usefulsenseinformationcanextendout for thousandsof wordsfrom thetargetterm. In practiceYarowsky (1992)usesafixedwindow of 100words.

Galeet al. (1992c)andGaleet al. (1992a)provide evidencethat in mostcasesonlyonesenseof a word is usedin a givendiscourse.For example,if theword bill is usedinits legislative sensein a discourse,thenit is unlikely to be usedin the senseof thebodypartof aduckin thatsamediscourse.They performedexperimentswhich indicatethatthesamesenseof apolysemouswordoccurredthroughoutencyclopediaarticlesandCanadianparliamentproceedings.It is possiblethat in texts whosecontentsare lessstereotyped,differentsensesof the sameword will occur, but in different contexts within the sametext, that is, not particularlynearoneanother. If this is the case,thenmotivatedmulti-paragraphsegmentationcouldhelpdeterminetheboundarieswithin whichsinglesensesofpolysemouswordsareused.

Anotherexampleof analgorithmthatderiveslexicalco-occurrenceinformationisWordSpace(Schutze1993b).In thisalgorithm,statisticsarecollectedaboutthecontextsin whichwordsco-occur. The resultsareplacedin a term-by-termco-occurrencematrix which isthenreducedusingavariantof multidimensionalscaling.Theresultingmatrixcanbeusedto makeinferencesaboutthe closenessof words in a multidimensionalsemanticspace.Currently the co-occurrenceinformation is found by experimentingwith differentfixedwindow sizesandchosingonethatworksbestfor a testset.

A critical assumptionunderlyingthesealgorithmsis thatthetermsco-occurringwithinthe text window do so becausethey are at leastloosely semanticallyrelated. It seemsplausiblethat changesin discoursestructurewill correspondto changesin word usages,andsothequalityof thestatisticsfor thesealgorithmsshouldbenefitfrom theuseof trainingtexts thathave beenpartitionedon thebasisof subtopiccontent.

Page 121: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 9

thatcanbeimplemented,thatcanberunonrealtexts,andthatcanrunonavarietyof textsindependentof theirdomainof discourse.Giventhecurrentstateof theart,thiscanbestbedonebymethodsthatwork in acoarsewayoncoarseunitsof information.Theapplicationsfor whichtheresultsareto beuseddonotnecessarilyrequirefine-graineddistinctions.Thisis especiallytrueof somekindsof informationretrieval applications.A usermight havedifficulty formulatingaqueryin whichmultipleembeddedlevelsof topicstructureneedbespecified,althoughthis kind of informationcould be usefulfor browsing. Most existingapproachesto discourseprocessingaretooambitiousto yield generallyapplicableresults;it is hopedthatby trying to makecoarserdistinctionsthe resultswill bemoreuniversallysuccessful.

2.3 Why Multi-Paragraph Units?

In schoolwe aredidacticallytaughtto write paragraphsin a certainform; thereforeacommonassumptionis thatmostparagraphshave a certainkind of well-formedstructure,completewith topicsentenceandsummarysentence.In real-worldtext, theseexpectationsareoftennotmet.Butevenif aparagraphiswrittenin aself-contained,encapsulatedmanner,aparticularsubtopicaldiscussioncanspanmultipleparagraphs,with onlydifferentnuancesbeingdiscussedin theparagraphsthatcomprisethediscussion.

Multi-paragraphsegmentationhasmany potentialapplications,including:

æ InformationAccessæ Corpus-basedComputationalLinguisticsæ Text DisplayandHypertextæ Text Summarization

Applicationsto informationaccessarea majorconcernof this thesisandarediscussedindetail in Chapter3. There,I describehow tiles areusedin an iconic graphicalrepresen-tation thatallows theuserto understandthedistributional relationshipsbetweentermsina queryandtermsin the retrieveddocuments.Anotherbenefitof usingmulti-paragraphsegmentationis thatsincein mostcasestherearefewertilesperdocumentthanparagraphs,tiles requirelessstorageandcomparisontime for otherwiseequivalent,paragraph-basedalgorithms.

However, multi-paragraphsegmentationhasbroaderapplications.Thesearedescribedbelow.

2.3.1 Corpus-based Computational Linguistics

An increasinglyimportantalgorithmicstrategy in computationallinguisticsis to deriveinformationaboutthe distributional patternsof languagefrom large text collections,or

Page 122: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER2. TEXTTILING 8

TextTiling. Theultimategoalis to notonly identify theextentsof thesubtopicalunits,butto label their contentsaswell. This chapterwill focusonly on thediscovery of subtopicstructure,leaving determinationof subtopiccontentto futurework. (Chapter4 discussesautomaticassignmentof maintopiccategories.)

2.2 What is Subtopic Structure?

In order to describethe detectionof subtopicstructure,it is importantto definethephenomenaof interest. Theuseof the term“subtopic” hereis meantto signify piecesoftext ‘about’ something(andis not to beconfusedwith the topic/comment(Grimes1975)distinction found within individual sentences).The intendedsenseis that describedinBrown & Yule (1983:69):

In order to divide up a lengthyrecordingof conversationaldatainto chunkswhich canbe investigatedin detail, the analystis often forcedto dependonintuitivenotionsaboutwhereonepartof aconversationendsandanotherbegins.... Which point of speaker-change,amongthe many, could be treatedastheendof onechunkof theconversation?This typeof decisionis usuallymadeby appealingto anintuitivenotionof topic. Theconversationalistsstoptalkingabout‘money’ andmove on to ‘sex’. A chunkof conversationaldiscourse,then,canbetreatedasa unit of somekind becauseit is ona particular‘topic’.Thenotionof ‘topic’ is clearlyanintuitively satisfactorywayof describingtheunifyingprinciplewhichmakesonestretchof discourse‘about’ somethingandthenext stretch‘about’ somethingelse,for it is appealedto very frequentlyinthediscourseanalysisliterature.

Yet thebasisfor theidentificationof ‘topic’ is rarelymadeexplicit.

Otherswho have statedthe intendedsenseinclude Rotondo(1984), who writes “Amacro-unitcanbe roughly definedasany coherentsubpartof a text which is assignedaglobalinterpretationof its own” andTannen(1984:38,citedin Youmans(1991))who,whendiscussingspokendiscourse,claims: “... themostusefulunit of studyturnedout to betheepisode,boundedby changesof topic or activity, ratherthan,for example,the adjacencypairor thespeechact.”

Hinds(1979:137)suggeststhatdifferentdiscoursetypeshavedifferentorganizingprin-ciples. TextTiling is gearedtowardsexpositorytext; that is, text thatexplicitly explainsorteaches,asopposedto, say, literary texts. More specifically, TextTiling is meantto applyto expositorytext that is not heavily stylizedor structured.A typical exampleis a five-pagesciencemagazinearticleor a twenty-pageenvironmentalimpactreport. It excludesdocumentscomposedof short“newsbites”or any otherdisjointed,althoughlengthy, text.

A two-level structureis chosenfor reasonsof computationalfeasibilityandto coincidewith the goalsof the useof thealgorithms’results. This thesisemploysonly algorithms

Page 123: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

7

Chapter 2

TextTiling

2.1 Introduction: Multi-paragraph Segmentation

The structureof expository texts can be characterizedas a sequenceof subtopicaldiscussionsthat occur in the context of a few main topic discussions.For example,apopularsciencetext calledStargazers, whosemain topic is the existenceof life on earthandotherplanets,canbedescribedasconsistingof thefollowing subdiscussions(numbersindicateparagraphnumbers):

1-3 Intro – thesearchfor life in space4-5 Themoon’schemicalcomposition6-8 How theearlyproximity of themoonshapedit

9-12 How themoonhelpedlife evolve onearth13 Theimprobabilityof theearth-moonsystem

14-16 Binary/trinarystarsystemsmakelife unlikely17-18 Thelow probabilityof non-binary/trinarysystems19-20 Propertiesof oursunthatfacilitatelife

21 Summary

Subtopicstructureis sometimesmarkedin technicaltextsby headingsandsubheadingswhichdividethetext into coherentsegments;Brown & Yule(1983:140)statethatthiskindof division is oneof themostbasicin discourse.However, many expositorytexts consistof longsequencesof paragraphswith very little structuraldemarcation.

This chapterdescribeswhy suchstructureis useful and presentsalgorithmsfor au-tomaticallydetectingsuchstructure.1 Becausethe modelof discoursestructureis oneinwhichtext ispartitionedintocontiguous,nonoverlappingblocks,I call thegeneralapproach

1I amgratefulto AnneFontainefor herinterestandhelpin theearlystagesof this work.

Page 124: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER1. INTRODUCTION 6

1.2 An Approach to Computational Linguistics

Onegoalof naturallanguageprocessingis to designprogramswhich interprettexts inmuchthesamewaythatahumanreaderwould. Sincethis is suchadifficult taskandsinceit requiresa largeamountof domainknowledge,mostof thework of this sort focusesonsmallcollectionsof sentences.Thisapproachis appropriatewhenautomatingdetailedtextinterpretation(e.g.,Schank& Abelson(1977),Wilks (1975),Wilensky (1983a),Charniak(1983),Norvig (1987))or whensupportinga theoryabouthumaninferenceandparsingmechanisms(e.g.,Martin (1990),Jurafsky (1992)),but with someexceptionsthestateofthe art is suchthat the useof this kind of analysisin informationaccessis still a distantgoal.

In thepastfiveyearstherehasbeenanincreasingtendency to takea data-intensiveap-proachto languageanalysis,focusingonbroadbut coarse-grainedcoverageof unrestrictedtext (Church& Mercer1993).Thisapproachisstill uncommonin theareaof discourseanal-ysis;thework hereis anexception.Thealgorithmspresentedherearedomain-independentbut approximate,scalablebut error-prone,in thehopesthattheir applicationto thecoarsergoalsof informationaccesswill neverthelessbeuseful. Suchapproximatemethodsseemespeciallyappropriatefor text segmentation,andinformationaccessmoregenerally. Theseareintrinsically “fuzzy” tasks,in thesensethat they generallyhave no objectively correctanswer, andmany differentresultsmaybedeemedreasonable(comparedwith, for example,grammaticalityjudgments).Readersoftendisagreeaboutwhereto draw aboundarymark-ingatopicshift, orwhetheragiventext is relevantto aquery;thereforeit seemsimplausibletoexpectexactanswerstosuchquestions.Thisthesisdemonstratesthatdespitetheinherentplasticityof thesetasks,automatingtheseprocessescanstill yield usefulresults.

Page 125: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER1. INTRODUCTION 5

13

8 26

44

2 15

38

2 15

38

Food

Governmentç

Legal

Environment

Politics

Commerceè

Weapons WaterTechnology

Figure1.4: A sketchof theCougarinterface;threetopic labelshavebeenselected.

category labelsto texts, andChapter5 presentsa new displaymechanismfor makingthisinformationavailableto theuser. Thecategorizationalgorithmusesa largetext collectionto determinewhich termsare salient indicatorsfor eachcategory. The algorithm alsoallows for theexistenceof multiple simultaneousthemessinceeachword in the text cancontributeto evidencefor acategorymodel,andeachwordcancontributeevidenceto morethanonemodel,if appropriate.Oneof thecategory setsusedby thealgorithmconsistsof106general-interestcategories;Chapter4 describesanalgorithmthatautomaticallyderivesthesecategoriesfrom anexisting hierarchicallexicon.

Oncemultiplemaintopiccategorieshavebeenassignedtoatext, they mustbedisplayedeffectively. Chapter5 describesaninterfacecalledCougar in whichfixedcategorysetsareusedfor twopurposes:toorienttheuserto thedatasetunderscrutiny, andtoplacetheresultsof thequeryinto context (seeFigure1.4). Cougarallowsusersto view retrieveddocumentsin termsof the interactionamongtheir main topics, using the categorizationalgorithmfrom Chapter4 to provide contextual information. The interfacehelps usersbecomefamiliar with the topicsandterminologyof anunfamiliar text collection. A consequenceof allowing multipletopicsperdocumentis thatthedisplaymusthandlemulti-dimensionalinformation. The approachusedhereagainallows user input to play a role: the userspecifieswhich categoriesareat thefocusof attentionat any giventime. Cougarsuppliesa simplemechanismof visual intersectionto allow usersto understandhow the retrieveddocumentsarerelatedto oneanotherwith respectto theirmaintopiccategories.

Page 126: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER1. INTRODUCTION 4

Figure1.3: TileBarsfor a queryin which thetermsmultimediaandvideoarecontrasted.Rectanglescorrespondto documents,squarescorrespondto TextTiles, the darknessof asquareindicatesthefrequency of termsin thecorrespondingTermSet.Thetitle andinitialwordsof adocumentappearnext to its TileBar.

TileBars useTextTiles to breakdocumentsinto coherentsubparts. The query termdistribution is computedfor eachdocumentand the resultingfrequency is indicatedforeachtile, in abar-like image.Thebarsfor eachsetof querytermsaredisplayedin astackedsequence,yielding a representationthat simultaneouslyandcompactlyindicatesrelativedocumentlength,queryterm frequency, andqueryterm distribution. The representationexploits the naturalpattern-recognitioncapabilitiesof the humanperceptualsystem;thepatternsin a columnof TileBarscanbequickly scannedanddeciphered.

TileBarssupporta paradigmin which thesystemdoesnot decideon a singlerankingstrategy in advance,but insteadprovidesinformationthatallowstheuserto determinewhatkind of distributionalrelationshipsareuseful.Chapter3 describesTileBarsandtheiruses,aswell asotherissuesrelatingto passageretrieval.

TileBars displaydocumentsonly in termsof wordssuppliedin the userquery. Fora given retrieved text, if the querywordsdo not correspondto its main topics, the usercannotdiscernthecontext in which thequerytermswereused.For example,a queryoncontaminantsmayretrieve documentswhosemaintopicsrelateto nuclearpower, food,oroil spills. To help accountfor this, I suggestassigningto eachtext category labelsthatcorrespondto its maintopics,sothatuserscangeta feelingfor thedomainin whichquerytermsare to be used. Thus if two documentsdiscussthe samemain topic themesbutusedifferenttermsto do so, oneunified category canbe usedto representtheir content.Similarly, if a documentusesmany differenttermsto build up theimpressionof a theme,thenthe category cancapturethis informationin a compactform. If a documentis bestdescribedby more than one category, it can be assignedmultiple categories, and twodocumentsthatshareonemajorthemebut do not shareotherscanbeshown to berelatedonly alongtheoneshareddimension.

Towardthisend,Chapter4 describesanalgorithmthatautomaticallyassignsmaintopic

Page 127: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER1. INTRODUCTION 3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 10 20 30 40 50 60 70 80 90

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 20 21 2223

Figure1.2: Theoutputof theTextTiling algorithmwhenrunontheMagellanText. Internalnumbersindicateparagraphnumbers. Vertical lines indicateboundarieschosenby thealgorithm;for example,the leftmostvertical line representsa boundaryafterparagraph3.Notehow thesealignwith theoutlineof theMagellantext in Figure1.1.

calledTextTiling that attemptsthis task. The TextTiling algorithmmakesuseof lexicalcohesionrelationsto recognizewheresubtopicchangesoccur. For a givenblock size,thealgorithmcomparesthe lexical similarity of every pair of adjacentblocks. The resultingsimilarityscoresareplottedagainstsentencenumber,andafterbeinggraphedandsmoothed,theplot isexaminedfor peaksandvalleys(seeFigure1.2). Highsimilarityvalues,implyingthat the adjacentblockscoherewell, tendto form peaks,whereaslow similarity values,indicatinga potentialboundarybetweenTextTiles, createvalleys. Thealgorithm’s resultsfit betweenupperand lower evaluationbounds,wherethe upperboundcorrespondstoreaderjudgmentsandthelowerboundis asimple,reasonableapproachto theproblemthatcanbeautomated.TextTiling is discussedin Chapter2.

By castingdocumentcontentin termsof topicalstructure,I have developednew ideasaboutthe role of documentstructurein informationaccess. An inherentproblemwithinformation retrieval ranking functions is they makea decisionabout the criteria uponwhich documentsarerankedwhich is opaqueto theuser. This is especiallyproblematicwhen performinga retrieval function other than full similarity comparisonsincequerytermscanhave many differenttermdistribution patternswithin a full-text document,anddifferentpatternsmay imply differentsemantics.In somecasesa usermight like to finddocumentsthat discussoneterm asa main topic with perhapsjust a shortdiscussionofanotherterm as a subtopic. Current information accessparadigmsprovide no way toexpressthis kind of preference. To help remedythis, I presenta new representationalparadigm,calledTileBars, whichprovidesacompactandinformativeiconicrepresentationof thedocuments’contentswith respectto thequeryterms(seeFigure1.3). TileBarsallowusersto makeinformeddecisionsaboutnotonly which documentsto view, but alsowhichpassagesof thosedocumentsto select,basedon thedistribution of thequerytermsin thedocuments.

Page 128: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CHAPTER1. INTRODUCTION 2

1- 2 Intro to Magellanspaceprobe3 Atmosphereobscuresview4 Climate

5- 7 Meteors8-11 Volcanicactivity

12-15 Styxchannel16-17 AphroditeHighland

18 Gravity readings19-21 Recentvolcanicactivity22-23 Futureof Magellan

Figure1.1: Paragraph-level breakdown of thesubtopicstructureof anexpositorytext.

includea discussionof evidencefor volcanicactivity on Venusanda discussionof a largechannelknown astheRiver Styx. If the topic “volcanicactivity”, or perhaps“geologicalactivity”, is of interestto a user, aninformationaccesssystemmustdecidewhetheror notto retrieve this document.Sincevolcanismis not a main topic, the frequenciesof useofthis term will not dominatethe statisticscharacterizingthe document;therefore,to find“volcanicactivity” in thiscase,asystemwill haveto retrievedocumentsin whichthetermsof interestarenot themostfrequenttermsin thedocument.On theotherhand,thesystemshouldnotnecessarilyselectadocumentjustbecausethereareafew referencesto thetargetterms.Informationaboutthetopicstructurewouldallow a distinctionto bemadebetweenmain topics,subtopics,andpassingreferences.Thusthereis a needfor identifying thetopicstructureof documents.

In thisdissertationI suggestthattherelativedistributionof termswithin a text providescluesaboutits maintopicandsubtopicstructure,andthatthis informationshouldbemadeexplicit andavailableto theusersof a full-text informationaccesssystem.

Oneway to try to determineif two termsoccurin thesamesubtopicor in someotherco-modificationalrelationshipis to observe whetherbothoccurin thesamepassageof thetext. However, thenotionof “passage”is not well defined.(In many casesauthor-definedsectioninginformationis notpresentor is toocoarse-grained.)A simpleassumptionis thatevery paragraphis a passageandevery passageis a paragraph.But oftenthecontentsofa long text canbeunderstoodin termsof groupingsof adjacentparagraphs,asseenin theexampleabove. Thisobservationopensa new questionfor computationallinguistics:howcanmultiple-paragraphpassagesbeautomaticallyidentified?

A simpleapproachis to dividedocumentsinto approximatelyeven-sized,but arbitrarilychosen,multi-paragraphpieces.A moreappealing,but lessstraightforwardlyautomatizableapproachis to groupparagraphstogetherthatdiscussthesamesubthemeor subtopic.Thisdissertationdescribesa fully-implemented,domain-independenttext analysisapproach

Page 129: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

1

Chapter 1

Introduction

1.1 Full-Text Information Access

Full-lengthdocumentshave only recentlybecomeavailableonline in largequantities,althoughbibliographicrecordsandtechnicalabstractshavebeenaccessiblefor many years(Tenopir& Ro 1990). For this reason,informationretrieval researchhasmainly focusedon retrieval from titles andabstracts.In this dissertation,I arguethat the advent of full-lengthtext shouldbemetwith new approachesto text analysis,particularlyfor thepurposesof information access.1 I emphasizethat, for the purposesof information access,fulltext requirescontext, that is, the mechanismsusedfor retrieval and display of full-textdocumentsshouldtakeinto accountthe context in which the querytermsanddocumenttermsareused. Eachchapterof this thesisdiscussessomeaspectof supplyingor usingcontextual informationin orderto facilitateinformationaccessfrom full text documents.

Thisemphasisoncontext in full-text informationaccessarisesfromtheobservationthatfull text is qualitatively differentfrom abstractsandshorttexts. Mostof thecontentwordsin anabstractaresalientfor retrievalpurposesbecausethey actasplaceholdersfor multipleoccurrencesof thosetermsin theoriginal text, andbecausethesetermstendto pertaintothe most importanttopics in the text. On the otherhand,in a full-text document,manytermsoccurwhichdonotrepresenttheessenceof themaincontentsof thetext. Expositorytexts suchassciencemagazinearticlesandenvironmentalimpactreportscanbeviewedasconsistingof a seriesof short,sometimesdenselydiscussed,subtopicsthatareunderstoodwithin thecontext of themaintopicsof thetexts.

Considera23-paragrapharticlefromDiscover magazine.A readerdividedthistext intothesegmentsof Figure1.1,with the labelsshown, wherethenumbersindicateparagraphnumbers.Themaintopicof thistext istheexplorationof VenusbythespaceprobeMagellan.Therearealsoseveral subtopicaldiscussionsthatcover morethanoneparagraph.These

1The term information access is beginning to supercedethat of information retrieval sincethe latter’simplication is too narrow; the field shouldbe concernedwith informationretrieval, display, filtering, andqueryfacilitation.

Page 130: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

viii

I have beenprivileged to learn aboutlanguageand the mind from Chuck Fillmore, AlisonGopnik, Paul Kay, George Lakoff, John Searle,and Dan Slobin. Michael Ranney has beenespeciallysupportive and informative aboutCognitive Science. Ray LarsonandBill CooperoftheBerkeley Schoolfor Library Studieshave bothgenerouslysharedtheiradvice,knowledge,andbookson informationaccess,andMichaelBucklandandCliff Lynchhave keptmeinformedof thecuttingedgeof thefield. In theComputerSciencedepartment,Tom Anderson,Mike Clancy, JimDemmel,RandyKatz,JohnOusterhoutandKathyYelick havebeenhelpfulandapproachable,andI greatlyadmireDave Patterson’s confidenceandoptimismasbotha leaderanda researcher.

KenChurchhelpedpioneerthefield of corpus-basedcomputationallinguistics,andI owe hima specialthanksfor inviting me to an instructive summertimevisit at AT&T Bell Laboratories.Iwouldalsolike tothankPaulJacobsfor organizingthe1990AAAI SpringSymposiumonText-basedIntelligentSystems,whichwasa watershedeventin thecourseof my research.

I havebenefittedfrom interactionswith my coauthorsAnneFontaine,Greg Grefenstette,DavidPalmer, Chris Plaunt,Philip Resnik,andHinrich Schutze,andconference-colleaguesIdo Dagan,Haym Hirsh andDavid Yarowsky. I am alsograteful for the adviceandopinionsof Bill Gale,GraemeHirst, andGerrySalton.

JohnOusterhouthasdoneeveryonea greatserviceby inventingTcl/Tk. I’m alsogratefultoEthanMunsonfor maintainingandcustomizingthe latex thesisstyle files, andDan Jurafsky forcreatingthelsalikebibtex style.

SheilaHumphreys hasbeeninfallibly supportive, including finding financial supportfor meduringatricky period.KathrynCrabtreemakessplendidthedifferencebetweenattendingBerkeleyasan undergraduateandasa graduatecomputersciencemajor. Liza Gabato,JeanRoot, TeddyDiaz,andCrystalWilliams of theCSDepartmentstaff have beenveryhelpful throughtheyears.

Friendsfrom theoutsideworld who have stuckwith methroughthis includeIreneFong,JaneChoi Greenthal,Annie andBret Peterson,KayakoShimada,Terry Tao, Greg TheisenandSusanWood.

JohnBatali taughtmeaboutphilosophy, feminism,anddinosaurs.

This researchhasbeensupportedatvarioustimesby thefollowing sourcesof funding(in orderof appearance):TheU.S.Air ForceOfficeof ScientificResearch,Grant83-0254andtheNavalElectronicsSystemsCommandContractN39-82-C-0235.TheArmy ResearchOrganization,GrantDAAL03-87-0083andtheDefenseAdvancedResearchProjectsAgency throughNASA GrantNAG 2-530.A CaliforniaLegislativeGrant.Digital EquipmentCorporationunderDigital’s flagshipresearchprojectSequoia2000: LargeCa-pacityObjectServersto SupportGlobalChangeResearch.AT&T Bell Laboratories.TheAdvancedResearchProjectsAgency (ARPA) underGrantNo. MDA972-92-J-1029with theCorporationfor NationalResearchInitiatives(CNRI).

TheXeroxPalo Alto ResearchCenterhassupportedmy work from 1989to thepresent.

I’d like to thankmybrotherEd(“roll with thepunches”)andmysisterDor for beingsympathetic,andGrandmaMar for encouragingus to do what makesus happy. Finally, I’d like to thankmyparentsfor theirundividedlove andsupportandall thoseSundaynightdinnersin Berkeley.

Page 131: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

vii

Acknowledgments

Researchis a surprisinglysocialactivity. Graduateschoolhasbeenan enormouslypositiveexperience,in largepartbecauseof theoutstandingpeoplearoundme.

I would like to thankmy advisor, RobertWilensky, for his supportive,unflaggingenthusiasmfor this work, his helpful analyticinsights,andhis willingnessto let mechoosesomeof my ownpaths.His commentshave greatlyimprovedthis thesis,andhewasextraordinarilyfastat readingandreturningchaptersto me,despitehis pressingdutiesasDepartmentChair.

I would alsolike to thanktheothermembersof my committee:RayLarson,for contributinghis expertisein information retrieval, and Jerry Feldman. Dan Jurafsky readand critiqued thisentirethesis,andfor this hehasmy profusethanks.NarcisoJaramilloalsoprovidedmany helpfulcommentsat thefinal hour.

Mike Stonebrakerencouragedmetoentergraduateschoolin computerscience,wrotethecrucialletter, convincedme to cometo Berkeley, gave mea researchassistantshipunderwhich I earnedmy master’s degree,andevensupportedmefor a time afterI switchedfields. In many waysMikeis a visionaryandhis attitudesaboutthefield andhow to doresearchhave stronglyinfluencedme.

PeterNorvig, whoknowsaboutall NLP work everdone(andhasimplementeda largefractionif it) hashelpedmeoutatseveralcritical strategic junctions,andtriedhardnot to look atmefunnythefirst timeI mentioned“big text”.

I cannotbegin to statethe importanceof my continuedassociationwith theXerox Palo AltoResearchCenter. I oweanenormousdebtto Per-Kristian Halvorsen,aMontague-semanticianwholiked my off-the-wall ideasaboutcognitive linguisticsandinvited me to spendmy first summerat PARC. JanPedersen,as project leaderfor the informationaccessgroup,andasa friend, hasbeenconstantlysupportive, and hasansweredinnumerablequestionsaboutstatistics. Over thepastfive years,the ThursdayReadingGrouphasprovided a thought-provoking but lightheartedforum for discussionof computationallinguisticsandinformationaccess.In additionto JanandPer-Kristian,thisinsightfulgrouphasincludedFrancineChen,DougCutting,MaryDalrymple,KenFeuerman,David Hull, RonKaplan,LaurieKarttunen,Martin Kay, David Karger, JulianKupiec,Chris Manning, JohnMaxwell, Hinrich Schutze, PenniSibun, JohnTukey, Lynn Wilcox, MegWithgott, andAnnie Zaenen. JeanetteFigueroais the mosteffective, efficient, andaffectionateadministrative assistantthat anyone could ever hopeto work with. Mark Weiseris a model toemulate.

Daily life in graduateschoolsparkledin TheOffice of All Topics(no topic is taboo). I willmisstheanalysissessionsandthelaughtersharedwith MichaelBraverman,nj Jaramillo,andMikeSchiff, not to mentiontheirskills atpapercritiquing, talk debugging,andquestionanswering.

My otherBerkeley colleaguesin (computational)linguistics– JaneEdwards,AdeleGoldberg,DanJurafsky, SteveOmohundru,TerryRegier, AndreasStolcke,andDekaiWu–aresimultaneouslybrilliant andfun andhave greatlyenrichedmy understandingof thefield.

I am the last of the four “mars” to graduate:Mary Gray Baker, Marie desJardinsandMargoSeltzerhavebeeninvaluablefriendsandcolleagues.Othersoncampusthathave providedsupport,advice,andfriendshipincludeNinaAmenta,ElanAmir, PaulAoki, FrancescaBarrientos,MarshallBern, Eric Enderton,Mark Ferrari, JohnHartman,Mor Harchol, Chris Hoadley, Kim Keeton,AdamLandsberg, Steve Lucco,BruceMah,Nikki Mirghafori,SylviaPlevritis, Patti Schank,MarkSullivan,SethTeller, TaraWeber, andRandiWeinstein.

Page 132: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CONTENTS vi

5.4 Conclusions Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 109

6 Conclusions and Future Work 110

A Tocqueville, Chapter 1 113

Bibliography 116

Page 133: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

CONTENTS v

3.3 LongTexts andTheir Properties ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 403.4 Distribution-SensitiveInformationAccess Â�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 44

3.4.1 TheProblemwith Ranking ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 443.4.2 Analogyto Problemswith QuerySpecification Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 453.4.3 TileBars Â�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 483.4.4 CaseStudies Â[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 52

3.5 Passage-basedInformationAccess Â[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 573.5.1 An Analysisof two TRECTopic DescriptionsÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 593.5.2 Similarity-basedPassageRetrieval ExperimentsÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 653.5.3 OtherApproaches ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 68

3.6 Conclusions Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 69

4 Main Topic Categories 714.1 Introduction Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 714.2 Preview: How to useMultiple Main TopicCategories ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 714.3 AutomaticAssignmentof Multiple Main Topics ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 73

4.3.1 Overview ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 734.3.2 Yarowsky’sDisambiguationAlgorithm ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 744.3.3 Lexically-BasedCategories ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 754.3.4 DeterminingSalientTerms ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 754.3.5 RelatedWork andAdvantagesof theAlgorithm Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 77

4.4 Evaluationof theCategorizationAlgorithm ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 794.4.1 TheTestSet Â[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 804.4.2 TheExperiment Â[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 814.4.3 Analysisof Results Â[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 82

4.5 CreatingThesauralCategories Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 854.5.1 CreatingCategoriesfrom WordNet ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 874.5.2 AssigningTopicsusingtheOriginalCategory Set ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 894.5.3 CombiningDistantCategories Â[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 904.5.4 RevisedTopic Assignments ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 91

4.6 Conclusions Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 92

5 Multiple Main Topics in Information Access 975.1 Introduction Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 975.2 CurrentApproaches Â[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 98

5.2.1 OverallSimilarity Comparison ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 995.2.2 User-specifiedAttributes Â[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 101

5.3 Multiple Main Topic Display Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 1025.3.1 DisplayingFrequentTerms ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 1035.3.2 DisplayingMain Topic Categories Â�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 1045.3.3 A BrowsingInterface Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 1045.3.4 Discussion ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 107

Page 134: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

iv

Contents

1 Introduction 11.1 Full-Text InformationAccess Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 11.2 An Approachto ComputationalLinguistics ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 6

2 TextTiling 72.1 Introduction:Multi-paragraphSegmentation Â[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 72.2 Whatis SubtopicStructure? Â[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 82.3 Why Multi-ParagraphUnits? Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 9

2.3.1 Corpus-basedComputationalLinguistics Â[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 92.3.2 OnlineText DisplayandHypertext ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 112.3.3 Text SummarizationandGenerationÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 11

2.4 DiscourseStructure Â[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 132.4.1 Granularityof DiscourseStructure ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 142.4.2 Topologyof DiscourseStructureÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 152.4.3 GrammarsandScripts Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 17

2.5 DetectingDiscourseStructure Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 182.5.1 DistributionalPatternsof CohesionCues Â[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 182.5.2 Lexical CohesionRelations ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 19

2.6 TheTextTiling Algorithm ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 232.6.1 Tokenization Â[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 252.6.2 Similarity Determination Â[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 252.6.3 BoundaryIdentification ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 262.6.4 Embellishments Â[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 27

2.7 Evaluation ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 292.7.1 ReaderJudgments ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 292.7.2 Results ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 30

2.8 An ExtendedExample:TheTocqueville Chapter Â[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 332.9 Conclusions Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 36

3 Term Distribution in Full-Text Information Access 373.1 Introduction Â[ÂTÂ�ÂTÂ[ÂMÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 373.2 Background:StandardRetrieval TechniquesÂTÂ[ÂMÂ[Â[ÂMÂ[ÂMÂ[ÂMÂ[ÂTÂ�ÂTÂ[ 38

Page 135: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

iii

In memoryof my grandfather, Alan Joseph.

Page 136: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

2

theiroverallsimilarity to aquery. For example,ausercanchooseto view thosedocuments

thathaveanextendeddiscussionof onesetof termsanda brief but overlappingdiscussion

of asecondsetof terms.Thisrepresentationalsoallowsfor relevancefeedbackonpatterns

of termdistribution.

TileBarsdisplaydocumentsonly in termsof wordssuppliedin the userquery. For a

givenretrievedtext, if thequerywordsdonotcorrespondto its maintopics,theusercannot

discernin whatcontext thequerytermswereused.For example,a queryoncontaminants

mayretrieve documentswhosemaintopicsrelateto nuclearpower, food,or oil spills. To

addressthis issue,I describea graphicalinterface,calledCougar, that displaysretrieved

documentsin termsof interactionsamongtheir automatically-assignedmain topics,thus

allowingusersto familiarizethemselveswith thetopicsandterminologyof atext collection.

Page 137: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

1

Abstract

Context and Structurein Automated Full-Text Information Access

by

Marti A. Hearst

Doctorof Philosophyin ComputerScience

Universityof CaliforniaatBerkeley

RobertWilensky

ThesisChair

Thisdissertationinvestigatestheroleof contextualinformationin theautomatedretrieval

anddisplayof full-text documents,usingrobustnaturallanguageprocessingalgorithmsto

automaticallydetectstructurein and assigntopic labelsto texts. Many long texts are

comprisedof complex topic andsubtopicstructure,a fact ignoredby existing information

accessmethods.I presenttwoalgorithmswhichdetectsuchstructure,andtwovisualdisplay

paradigmswhich usethe resultsof thesealgorithmsto show the interactionsof multiple

maintopics,multiplesubtopics,andtherelationsbetweenmaintopicsandsubtopics.

The first algorithm, called TextTiling, recognizesthe subtopicstructureof texts as

dictatedby their content. It usesdomain-independentlexical frequency anddistribution

information to partition texts into multi-paragraphpassages.The resultsare found to

correspondwell to readerjudgmentsof majorsubtopicboundaries.Thesecondalgorithm

assignsmultiplemaintopiclabelstoeachtext, wherethelabelsarechosenfrompre-defined,

intuitivecategorysets;thealgorithmis trainedonunlabeledtext.

A new iconicrepresentation,calledTileBars usesTextTiles to simultaneouslyandcom-

pactlydisplayquerytermfrequency, querytermdistributionandrelativedocumentlength.

This representationprovidesan informative alternative to rankinglong texts accordingto

Page 138: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

Context andStructure

in AutomatedFull-Text InformationAccess

Copyright cé

1994

by

Marti A. Hearst

Page 139: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

Context and Structure

in Automated Full-Text Information Access

by

Marti A. Hearst

B.A. (Universityof CaliforniaatBerkeley) 1985M.S. (Universityof Californiaat Berkeley) 1989

A dissertationsubmittedin partialsatisfactionof the

requirementsfor thedegreeof

Doctorof Philosophy

in

ComputerScience

in the

GRADUATE DIVISION

of the

UNIVERSITY of CALIFORNIA at BERKELEY

Committeein charge:

ProfessorRobertWilensky, ChairProfessorRayLarsonProfessorJeromeFeldman

1994

Page 140: BIBLIOGRAPHY 129 AROWSKY AVID. 1992. Word sense ...people.ischool.berkeley.edu › ~hearst › papers › phdthesis.pdf · Answer passage retrieval by text searching. Journal of the

Context and Structure in Automated Full-Text

Information Access

Marti A. Hearst

Report No. UCB/CSD-94/836

April 29,1994

ComputerScienceDivision(EECS)

Universityof California

Berkeley, California94720