[ACM Press the eleventh international conference - McLean, Virginia, USA (2002.11.04-2002.11.09)] Proceedings of the eleventh international conference on Information and knowledge

Thematic Mapping – From Unstructured Documents to Taxonomies

Christina Yip Chung, Raymond Lieu, Jinhui Liu, Alpha Luk, Jianchang Mao, Prabhakar Raghavan

Verity, Inc. 894 Ross Drive

Sunnyvale, CA 94089 +1-408-541-1500

{cchung, rlieu, jliu, aluk, jmao, praghava}@verity.com

ABSTRACT Verity Inc. has developed a comprehensive suite of tools for accurately and efficiently organizing enterprise content which involves four basic steps: (i) creating taxonomies, (ii) building classification models, (iii) populating taxonomies with documents, and (iv) deploying populated taxonomies in enterprise portals. A taxonomy is a hierarchical representation of categories. A taxonomy provides a navigation structure for exploring and understanding the underlying corpus without sifting through a huge volume of documents. Thematic Mapping automatically discovers a concept tree from a corpus of unstructured documents and assigns meaningful labels to concepts based on a semantic network. Integrating with Verity Intelligent Classifier’s user-friendly GUI, a user can drill down a concept tree for navigation, perform a conceptual search to retrieve documents pertaining to a concept, build a taxonomy from the concept tree, as well as edit a taxonomy to tailor it into various views (customized taxonomies) of the same corpus. Classification rules can be generated automatically from concepts. These classification rules can be used for populating documents into the taxonomy.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – abstracting methods, linguistic processing.

General Terms Algorithms, Design.

Keywords Thematic Mapping, Concept Discovery, Clustering and Labeling, Concept Tree Construction and Visualization, Conceptual Search

1. INTRODUCTION Searching provides an efficient way for users to find relevant information from huge amounts of data only if they know what to

search for. There is a need to facilitate users in forming effective queries through browsing and navigating information in a relatively small and manageable space. Examples of such a space include taxonomies and concept trees. Taxonomies organize documents into navigable structures to assist users in finding relevant information. Searches within a category typically produce more relevant results than un-scoped searches. Concept trees organize concepts from general to specific concepts and can assist users in understanding information content of a collection. Taxonomy construction is a challenging problem that works best with sufficient domain knowledge. Fully automatic construction often leads to unsatisfactory results. Consequently, most taxonomies are built and maintained manually by human experts. Well-known examples include the directory structures of Yahoo! and Open Directory Project. However, manual taxonomy construction is very time consuming and costly. Systems that can (partially) automate or facilitate the taxonomy construction process are thus highly desirable. This paper describes the Thematic Mapping system, a key component of Verity Intelligent Classifier, Verity’s comprehensive enterprise content organization tool suite. Thematic Mapping automatically identifies important concepts in a corpus of unstructured documents. Concepts identified are arranged into a tree hierarchy to reflect general-to-specific relationships. Meaningful names are assigned to concepts. Concept trees generated by Thematic Mapping can be used as seed taxonomies for the corpus. Through Verity Intelligent Classifier’s user-friendly GUI, a user can easily build a taxonomy from a concept tree, as well as edit a taxonomy to tailor it into various views (customized taxonomies) of the same corpus. Classification rules can be generated automatically from concepts in the concept tree. These classification rules can be used for populating documents into the taxonomy. The Thematic Mapping system also supports concept tree/taxonomy visualization and navigation, and conceptual search (searching for documents pertaining to a concept).

2. THEMATIC MAPPING SYSTEM The Thematic Mapping system consists of the following modules: (i) Signature Extractor, (ii) Concept Clustering, (iii) Concept Labeling, (iv) Visualization, (v) Conceptual Search, (vi) Taxonomy Construction. All these modules have been integrated into Verity Intelligent Classifier [15].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM ’02, November 4-9, 2002, McLean, Virginia, USA. Copyright 2002 ACM 1-58113-492-4/02/0011…$5.00.

608

A signature is a noun or noun phrase. A concept is a cluster of related signatures. A concept tree organizes related concepts into a tree hierarchy in which general concepts are parent of specific concepts.

Figure 1. Thematic Mapping System Diagram. Signature Extractor identifies signatures that characterize the content of input documents using part-of-speech tagging and corpus statistics analysis. Concept Clustering discovers concepts and organizes them into a concept tree. First, probability distributions of signatures are computed and refined to address data sparsity and polysemy issues. Second, related signatures are identified with a similarity measure derived from a modified Kullback-Leibner distance and mutual information. This is followed by a clustering process that organizes concepts into a concept tree. In a high-quality concept tree where each node is represented by a cluster of signatures, clusters are dissimilar (high inter-cluster distance) while members of a cluster are similar (low intra-cluster distance). The connectivity between two clusters measures their inter-cluster similarity, which is the average similarity measure between any two members from the clusters. The intra-cluster similarity of a cluster is captured by its compactness, which is the average similarity between its members. A greedy agglomerative approach is adopted to group concepts into a taxonomy. The input to the procedure is a set of signatures, which are clusters of size one. At each iteration, the pair of clusters with the highest connectivity are joined to form a new cluster, or merged together, replacing the original clusters. This means that the resultant concept tree is not necessarily (and usually not) a binary tree. Several heuristics, based on the principles of minimizing overall connectivity and maximizing overall compactness, are used for determining whether to form a new cluster by joining the two, or to merge them. The process stops when the connectivity of the pair of clusters with the highest connectivity falls below a certain threshold. Concept Labeling assigns meaningful labels to concepts in a taxonomy based on a semantic network. Labels are not necessarily signatures in concepts. Good labeling is critical in easy navigation of a large taxonomy. The senses of a signature in a cluster are first determined based on a lexical reference system (such as WordNet [10]). Each sense of a signature is represented by a set of words describing its meaning. This set of words is then expanded by including synonyms, hypernyms and hyponyms of the sense. Dominant

concepts of a cluster are identified as follows: If the number of common signatures between a pair of senses is above a predefined threshold, the similarity measure between the senses is 1; otherwise, it is 0. The support of a sense is the number of senses pertaining to the cluster with similarity measure to the sense equal 1. Dominant concepts for the cluster are the senses with the highest supports. Each cluster is named by meaningful labels which are derived from its dominant concepts and signatures of its children clusters. This way we capture concepts with low support in isolated children clusters but with high enough supports in the parent cluster. A heuristic is used to choose a label that is neither too generic nor too specific based on their depths in the WordNet hierarchy. If no dominant concept is found, the two most frequent signatures in a cluster are selected as its label. Meaningful labels are assigned to clusters in a concept tree in a bottom up fashion. This reduces the time complexity of the process. Visualization provides a user-friendly GUI interface for a user to browse and navigate a taxonomy. This function is supported in Verity Intelligent Classifier. A user can display signatures belonging to a concept, expand/collapse a concept, as well as search for concepts containing a signature. Conceptual Search assists a user in forming effective queries by automatically creating Verity Topics (stored queries) based on concept signatures in a concept tree. The Verity Topic for a concept consists of signatures of the concept with weights computed based on statistics in the underlying corpus. Taxonomy Construction assists a user in creating a taxonomy and in populating the taxonomy with relevant documents. A concept tree created by Thematic Mapping is typically used as a seed taxonomy. The system supports manual editing of a concept tree or a taxonomy. A user can rename the label of a node (concept or category), add/delete a node, move a node around in the tree, and add/delete signatures to/from a concept. The taxonomy can be populated by associating its categories with documents retrieved using Verity Topics created by Conceptual Search.

Figure 2. Raw Concept Tree generated by running Thematic Mapping on a collection of articles from San Jose Mercury News. The concept labels shown are automatically generated. Users can enrich the labels manually.

609

3. APPLICATIONS OF THEMATIC MAPPING Thematic Mapping has been applied to a large number of corpora of different genres. Examples include newspaper articles, Usenet news group documents, patent documents, web documents from various content providers. While the quality of the concept tree generated depends on the quality of documents and document styles, Thematic Mapping discovers some interesting concepts and taxonomies for all the corpora tested. For instance, the concept tree discovered by running Thematic Mapping on San Jose Mercury News corpus (Figure 2) resembles the newspaper sections quite well.

4. SUMMARY AND RELATED WORK Effectively mining relevant information from a large volume of unstructured documents has received considerable attention in recent years [4][7][9]. Verity has developed a suite of tools to facilitate effective information retrieval, browsing, navigation, and content organization. Thematic Mapping is a novel approach to assist users in information retrieval that combines technologies in clustering, use of lexical reference, classification and visualization. Thematic Mapping technologies offer the following advantages:

1. The system operates in the concept space by grouping related words. Clusters of words clearly illustrate the information content of a corpus. Most related research operates in the document space by grouping related documents. The concept space is typically more manageable and contains more useful structures for navigating and searching a corpus than the document space with a large volume of documents.

2. The system assigns meaningful labels to concepts, which is critical in navigating a large taxonomy. Existing approaches do not typically assign meaningful labels to concepts [6][13], have to choose labels from words in concepts [11][12] or have to rely on meta-data [8].

3. The system uses both statistical information derived from a corpus and lexical reference. Smoothing and disambiguation techniques are used to refine the raw statistics. Approaches that rely on lexical reference alone are limited by the coverage of the lexical reference and are domain-dependent [6].

4. The system is capable of discovering a concept tree of adjustable depths. Anick and Tipirneni [1] used lexical reference to discover concepts of depth 2 . It does not lend itself directly to the discovery of a concept tree of more than two levels.

5. The system abstracts out information of a corpus at different abstraction levels by the tree representation of concepts and by the cluster representation of concepts. A tree representation is more useful for understanding general-to-specific relationships between concepts than a graph representation [5]. Similarly, organizing concepts (clusters of words) into a hierarchy is more applicable to understanding a corpus than organizing words into a hierarchy [14].

5. ACKNOWLEDGMENTS We would like to thank Sumit Taank and Vamsi Vutokuru from University of Texas at Austin for their contributions to the project during the summer of 2001.

6. REFERENCES [1] P. G. Anick, S. Tipirneni. The Paraphrase Search Assistant.

Terminological Feedback for Iterative Information Seeking. International Conference on Research and Development in Information Retrieval (SIGIR 1993), pp.153-159.

[2] C. Chung, A Luk, J. Mao, S. Taank. A Method and System for Naming a Cluster of Words and Phrases. US Patent application filed through Verity, Inc. 2001.

[3] C. Chung, J. Liu, A. Luk, J. Mao, S. Taank, V. Vutukuru. A System and Method for Automatically Discovering a Hierarchy of Concepts From a Collection of Documents. US patent application filed through Verity, Inc. 2002.

[4] B. S. Everitt, S. Landau, M. Leese. Cluster Analysis. Edward Arnold. ISBN: 0340761199. 4th edition. May 2001.

[5] R.H. Fowler, B.A. Wilson, W.A.L. Fowler. INFORMATION NAVIGATOR: An Information System using Associative Networks for Display and Retrieva.l Department of Computer Science, University of Texas at Pan American. Technical Report NAG9-551, #92-1.

[6] B. Gelfand, M. Wulfekuhler, and W. F. Punch III. Automatic Concept Extraction From Plain Text. AAAI Workshop on Learning for Text Categorization, Madison, July 1998.

[7] M. A. Hearst. Text data mining: Issues, techniques, and the relationship to information access. Presentation notes for UW/MS workshop on data mining, July 1997.

[8] T. Honkela, S. Kaski, K. Lagus, and T. Kohonen. WebSOM – Self-Organizing Maps of Document Collections. In Proceedings of Workshop on Self-Organizing Maps (WSOM97), Espoo. Finland, 1997.

[9] R. Kosala, H. Blockeel. Web Mining Research: A Survey. SIGKDD: SIGKDD Explorations. 2000

[10] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. Miller. Introduction to WordNet: An On-line Lexical Database. Communications of ACM. Nov. 1995. pp.39-41.

[11] A. Popescul and L. H. Ungar. Automatic Labeling of Document Clusters. http://www.cis.upenn.edu/~popescul/Publications/labeling_KDD00.pdf, 2000.

[12] A. Rauber. LabelSOM: On the Labeling of Self-Organizing Maps. http://www.ifs.tuwien.ac.at/~andi, 1999.

[13] A. E. Smith. Machine Mapping of Document Collections: the Leximancer. Proceedings of the 5th Australasian Document Computing Symposium. Sunshine Coast, Australia. December 1, 2000.

[14] M. Sanderson and Bruce Croft. Deriving Concept Hierarchies From Text. International Conference on Research and Development in Information Retrieval (SIGIR 1999), pp.206-213.

[15] Verity K2 Enterprise, Classification Users Guide V4.5. 2002.

610