Proceedings Template - WORDopim.wharton.upenn.edu/~sok/sokpapers/2010/A_New_… · Web viewCreating and exploiting aggregate information from text using a novel Categorized Document

Creating and exploiting aggregate information from text using a novel Categorized Document Base

Page 1 of 93

ABSTRACT

Document classification has long been a popular field of research in information retrieval.

Classification of documents is typically used to aid in faster location of relevant documents by end

users. In this paper, we present a method for constructing a Categorized Document Base (CDB) and

assess whether the categorization of a document collection in this manner can be helpful for another

purpose: understanding of, and comparison of, the categories in the document collection, through the

use of aggregate statistics from the documents in each category. Our experimental results indicate that,

taking a convenience sample of properties for which we were able to obtain independent assessments of

their values, there is a relatively clear association – in non-population-sensitive industries – between

what the CDB-based methods turn up and what the independent data says. This is evidence that CDB-

based methods could be useful for gauging non-population-sensitive properties for which independent

data does not exist.

Key words

Text Mining, Information Retrieval, Categorization

1. INTRODUCTION

Modern search engines have demonstrated their ability to retrieve and rank documents relevant to a

given search term and are well suited to finding documents relating to different topics. However, what

if an end-user would like to compare topics instead of merely retrieve a document? It is relatively

straightforward to take a categorization scheme, and pass each category (topic) in the scheme as a search

terms to a search engine, in order to gather a certain number of relevant documents for hundreds or even

thousands of categories. Once the few most relevant documents for each category have been obtained,

categories (topics) can be compared by computing aggregates on the documents for each category –

such as hit counts or relative frequencies for a specific word or phrase in the documents for each

category. But, is this information sensible? Will it correlate with what is known about those categories

(topics) from other data sources? Let us take a specific example: Can we reliably determine, from text

documents for multiple locations gathered and analyzed in the fashion described above, how those

locations compare with regard to, for instance, environmental characteristics (sunshine, warmth, natural

resources like oil and coal, existence of mountains, forests, or fishing) or market characteristics (demand

for different products and services across those locations)?

Our goal in this paper is to determine if the analysis obtained from unstructured text documents

using the above-described means is comparable in quality to data for those locations obtained from

conventional structured sources, such as traditional environmental surveys or market statistics. This

paper posits that it is indeed possible to obtain comparative information on locations by employing

search engines in the simple but unusual manner described above. Specifically, our objective is to test

the following hypothesis:


Page 3 of 93

Hypothesis H1: aggregate information on various natural and market phenomena,

extracted from text documents for United States locations in the fashion described above,

can provide better-than-random rankings of those locations based on the environmental

or market characteristics of those locations.

As shown in the Experimental Results (Section 6) we find some support for this hypothesis in

industries which are not sensitive to population – in these industries the Categorized Document Base

(CDB) is able to fruitfully compare locations that diverge widely on some environmental or market

characteristic. However, we find the hypothesis is not valid for population-sensitive industries where

the CDB is not able to distinguish subtle per-capita distinctions in market characteristics by location.

This paper, then, describes a novel method for creating and exploiting aggregate information from

CDBs, and assesses the quality of the information so produced. In a CDB, document sets are organized

by topic. CDBs occupy a useful middle ground between highly structured (e.g. tabular) data, and

completely unstructured (textual) data, by introducing some order and organization into the document

set. We aim to show that aggregate information can be derived from categorized document sets, and

that this information is valuable.

We begin with a motivation for this research, and a discussion of related work. We show how our

CDB approach differs from prior work in document classification: in particular, the prior art focuses on

the use of categorization to narrow down search results, whereas our CDB approach is targeted at

aggregate analyses. Next, we describe the process for creating and querying a CDB: from creation and

population of the classification scheme, and construction of the search term, to tabulation of the results

for the sub- and super-categories, and integration with other external data for the categories. We

document the technical implementation of our CDB prototype, including the system architecture and

data structure. To validate our approach for the creation and analysis of categorized document


Page 4 of 93

collections, we conduct experiments in a variety of industries. In these experiments, aggregate

information is generated from the CDB, categorized by location. Our experiments reveal that the CDB

can act as a rough instrument for discerning differences between locations: for industries where the

difference between locations are substantial, the aggregates computed by the CDB for each location are

statistically correlated with quantitative measures that we found for related natural and commercial

phenomena in those industries, from traditional structured data sources. In contrast, for industries with

subtle differences between locations, the CDB does not appear to be able to discern differences between

locations. Following our experiments, we discuss a number of useful applications of the CDB, and

conclude with limitations of the approach, areas for future work, and a summary of our process and

results.

2. MOTIVATION

Modern search engines produce a large number of relevant results for a given search term [111].

Unfortunately, comparing the results for different search terms – for example, comparing the results for

different categories in a taxonomy – is difficult. To compare the categories, users must consult

hundreds of documents and users soon suffer information overload: they quickly reach a futility point

[11] beyond which they will not search. Current research aimed at categorizing documents or search

results typically has as its goal narrowing down from a large collection of documents, to progressively

smaller collections, until a single document or set of documents which are relevant can be found [5, 15,

34, 47, 73, 74, 79, 80, 81, 136, 141, 143]. In contrast, the motivation for the current research is to

provide aggregations of search results across multiple categories, to facilitate comparison of the

categories themselves. Our goal is to allow the user to evaluate and compare categories using

aggregate statistics, rather than the conventional goal of allowing users to find a relevant document


Page 5 of 93

more rapidly. Effectively, we are looking to create new summary information for categories by

categorizing, ranking, and filtering millions of documents, rather than simply trying to find facts already

explicitly encoded in single documents. The summary information allows a management user to

determine the prevalence of a search phrase in millions of relevant documents for thousands of

industries, products, places, time periods, or other categories. This allows the manager to rank

industries, products, places, or topics, which, we suggest, could be helpful, for instance, for devising a

market rollout strategy. Of course, the analysis of documents from multiple categories is only useful if

the comparison it yields of the categories is a fair reflection of reality. Our goal is to test the validity of

the comparison generated from unstructured text, and to determine, using external evidence, whether the

comparison is a sensible and viable proxy for other available comparative data on the categories1.

3. RELATED WORK

Manual and automatic categorization of documents using classification schemes is a well studied field

of research [70]. In this section, we provide a thorough survey of the existing literature on document

classification, and we explain how our work relates to the document classification field.

Basic approaches to document classification involve manual tagging by humans: for example

Yahoo! [82] and domain specific subject gateways [134], both of which are quality-controlled document

1 The reader may be curious as to why compilation of aggregate data from text is useful if similar data is already available in structured sources. There are numerous reasons why the ability to glean the information from text is helpful:

a) Cost: Aggregate information from text may be cheaper to generate than using alternative sources for that information (e.g. gleaning data from text on locations may be cheaper than conducting geological surveys on those locations – for natural data – or market or sociological research on those locations – for commercial or social data.)

b) Time: Aggregate information from text may be more current if the text is current (e.g. alternative sources, typically constructed by manual human market research may be years out of date).

c) Breadth: Aggregate information from text may provide data on a broad range of phenomena that have not yet been the subject of manual surveys. E.g. while the economic effects of federal stimulus money on various locations can be readily assessed from tax return data, social effects are more difficult to rapidly and cheaply observe and would traditionally require manual surveys that may be prohibitively costly. If it can be shown that aggregate information from text is plausible, textual sources may more readily be consulted to assess such social effects, particularly in cases where manual surveys of a certain social phenomenon do not exist.


Page 6 of 93

collections, organized by subject. Numerous document classification approaches use automatic

classification by machine. Automatic text categorization research is many decades old [8, 9], and

dozens of text categorization and topic identification approaches have been suggested in the literature.

Automatic text categorization approaches can broadly be broken down into unsupervised and supervised

approaches:

3.1 Unsupervised automatic document classification

Unsupervised approaches for document classification include, inter alia, the use of Kohonen

self-organizing feature maps or neural nets to cluster documents [26, 44, 45, 56], hierarchical

clustering [23, 66, 133], Cluster Abstraction Models [59], and Suffix Tree Clustering based on

shared phrases [18, 140]. These approaches are termed ‘unsupervised’ as the algorithms are not

trained using a historic set of correctly classified documents but, rather, learn by finding

similarities within documents, and clustering documents by similarity.

3.2 Supervised automatic document classification

In supervised approaches a small number of documents are manually tagged with a limited set of

category identifiers. A machine learning approach is then used to infer general patterns so that

unseen documents can automatically be assigned to the categories. In Neural Network

approaches [27, 32, 88, 94, 113, 129, 135] a training set of documents and their categories is

used to adjust the arc weights in a graph of nodes composed of mathematical functions, until a

desired classification accuracy is obtained. Bayesian approaches [9, 21, 61, 69, 78, 80, 83, 100,

124, 130] make use of conditional probabilities to determine the likelihood of a document being

in a class given certain features of that document. In K-Nearest-Neighbor approaches [19, 58,

49, 50, 80, 84, 98] a document is compared to a set of pre-classified documents, and its class is

taken as the majority class of the few most similar documents. With rule induction approaches


Page 7 of 93

[1, 25, 76, 89] a set of rules is induced from a historic set of documents pre-tagged with their

categories, and the set of rules is then used to classify unseen documents. The induced rules

have the form “if (x AND y AND z AND …) then (document is in category P)”. Decision trees

[2, 19, 90] proceed similarly, except individual rule clauses – that is, simple condition

evaluations – are applied in turn, starting with a root clause, and proceeding with different

clauses in a branching fashion depending on the outcome of each condition evaluation. With

Support Vector Machines approaches [65, 85], vectors of n document features are defined, and

the hyperplane that provides maximal separation in the n-dimensional space is used to divide the

documents into separate categories. Genetic programming approaches [28, 126] start with

preliminary candidate classification rules – the best candidate rules are identified, and combined,

to produce successive generations of improved rules. Some automatic document classification

approaches employ Rocchio Relevance Feedback [64, 80] – these are based on Rocchio’s

algorithm [110] which attempts to find a query vector that maximizes the similarity of

documents in the same class, while minimizing their similarity to (i.e. maximizing the distance

from) documents in the alternate class. Miscellaneous other automatic document classification

approaches also exist: see, for example [67, 75, 91, 92, 131]. For comparisons and surveys of

text categorization approaches, see [47, 86, 117, 118, 137, 138, 139], and for a list of some

software implementations see [43].

Automatic document classification approaches typically make their inferences using internal

document content, external information, or user behaviors. Mechanisms that use document content [26]

may make use of key words, word strings / n-grams [54], linguistic phrases [106, 107, 108, 109], word-

clusters [124], or multi-word features [101] within the document. In contrast, some schemes use

external information in hyperlinked documents that point to the target document [40]. Finally, some


Page 8 of 93

approaches make inferences from search logs and/or behavioral patterns of users searching and

accessing the document set [22] or retain user preferences [131, 132] when categorizing documents.

Modern search engines, such as Google, employ a variety of means – including internal document

content [7], external information [7], and user behaviors [142] – to determine whether a document

should be classed in a particular result set, and how it should be ranked.

By employing modern search engines to construct our CDBs (see Section 4.2), the CDB

construction approach presented in this paper benefits from the state-of-the-art composite classification

algorithms that are effectively available by sending search terms for multiple categories to Google,

Yahoo, or any other search engine.

Many practical software implementations of automatic classification systems exist. GERHARD

(German Harvest Automated Retrieval and Directory) automatically classifies web content according to

the Universal Decimal Classification (UDC) scheme [97]. WebDoc [130] uses a naïve Bayes approach

to assign documents to the Library of Congress classification scheme. Scorpion automatically assigns

Dewey Decimal Classifications or Library of Congress Classifications to documents [48, 120, 128].

Commercial text mining software, such as SAS Text Miner2 clusters documents into categories by

extracting key terms from each document. Commercial document content mining tools – often known

as customer experience intelligence software products – such as Aubice3, Clarabridge4, IslandData5,

QL2 MarketVoice6, and similar products, are targeted at monitoring market trends, customer feedback,

or potential fraud, by scouring, categorizing, and summarizing web documents. Larkey and Croft’s

system [80] uses a variety of techniques to automatically assign ICD9 codes (i.e. category of diagnosis)

2 http://www.sas.com/technologies/analytics/datamining/textminer/3 http://www.aubice.com/4 http://www.clarabridge.com/5 http://www.islanddata.com/6 http://www.ql2.com/


Page 9 of 93

to dictated inpatient discharge summaries. Multiple systems for automatically categorizing news stories

exist [10, 57]. Though the afore-going list is not exhaustive, it demonstrates that automatic document

classification systems have enjoyed broad application.

A number of authors [5, 15, 28, 34, 47, 73, 74, 79, 80, 81, 131, 136, 141, 143] have proposed

techniques for clustering search results by category. The intention is typically to disambiguate different

senses or uses of the search term, or to simply visualize common topics in the search results, to facilitate

narrowing in on the correct document subset. For instance, “wind” results may be separated into

categories like:

“Weather” (containing documents like “A guide to wind speeds”)

“Movies” (containing documents like “Gone with the Wind”)

“Energy” (containing documents like “Wind Energy”)

Some approaches involve the post-processing of search results, so as to organize them by category,

to support more rapid navigation to the desired search results [5, 24, 140]. For example, commercial

search engines such as Northern Light7 and Vivisimo.com / Clusty.com, use post-processing to

categorize search results. In other cases, the document base is organized by category before conducting

the search (e.g. [105]).

In the field of Faceted Search [51, 52, 60] (also known as View-Based Search [104], Exploratory

Search [95, 114] or Hierarchical Search [39]) objects are catalogued, by annotating them with attribute

values or classifications, to facilitate navigation through or exploration of the catalogue. Objects in the

catalogue may be documents, web pages, products (e.g. consumer products), books, artworks, images,

people, houses, or other items. For example, the Flamenco faceted search engine, demonstrates use of

7 nlresearch.com USA Patent No. 5,924,090


Page 10 of 93

faceted search to browse catalogues of recipes [51], architectural images [52], Nobel prizewinners8, and

other domain-specific catalogues, with the ability to drill down by category (e.g. recipes may be divided

by ingredient, such as celery, onion, or potato). In a faceted search system, annotations may be

manually assigned or automatically assigned using information extraction techniques. The annotations,

or tags, allow the objects in the catalogue to be progressively filtered, to assist the user in finding a

particular type of item, and also allow the user to easily determine how many items of a certain type

exist in the catalogue. For example, a catalogue of products may be annotated to allow the user to find

all products of a certain brand, and within a certain price range. Faceted search techniques are very

useful for navigating a catalogue of a certain type of item through progressive filtering.

The mechanism we propose in this paper differs in a number of ways from the prior art. Firstly, we

gather documents into categories by obtaining the top results, by relevance, for each category, from any

existing search tool (e.g. Google, Yahoo, etc.). The authors currently have a patent pending on this

CDB construction and usage approach9. Secondly, our method is focused on exploration of the

aggregate statistics obtained from documents organized in a classification scheme, and not on searching

for a particular document. Unlike faceted search techniques, the CDB described in this paper is focused

on comparing categories using different metrics, rather than focusing on finding a particular type of

item by category, or on counting the number of items in each category. In contrast to faceted search

systems, where every category is populated with a certain type of item, CDB’s populate categories with

the most relevant documents for those categories.

The contrast between faceted search techniques and the CDBs we describe in this paper is best

illustrated with an example. Consider a user wishing to compare US locations to see where fishing is

popular. Faceted search engines that are used to catalogue artworks, people, houses, or books are 8 http://flamenco.berkeley.edu/demos.html (Accessed on 13 March 2009).9 United States Patent Application 20070106662 “Categorized Document Bases”.


Page 11 of 93

obviously not relevant, so the user looks for a faceted search engine that provides a catalogue of web

pages or images. The user types ‘fishing’ as the search term, but the results are categorized according to

classification schemes mandated by the faceted search engine. Figure 1 shows the results produced, for

example, by searching for ‘fishing’ on clusty.com. Figure 2 shows the results of the same search

performed using an alternative search results visualization engine, Kartoo.com. As can be seen from

both figures – which are indicative of the nature of output produced by search engines able to cluster or

categorize search results – information about fishing locations is haphazard and locations cannot be

compared. The user attempts another search ‘US locations fishing’. Figure 3 shows the results

produced by clusty.com10. Again, the result categories do not allow US locations to be compared to see

where fishing is popular. What the user really needs is, first, a dynamically constructed catalogue of US

locations, and then the ability to compare these locations for fishing. This is exactly what a CDB

provides: the ability to first generate an appropriate catalogue, and then allow comparison of categories

using an additional search term. In our approach, the user begins by obtaining a classification scheme

of US locations from the United States Geographic Names Service, and then feeds the classification

scheme to the CDB. The CDB populates each category (US location) with data, and the user is then

able to run a search term (‘fishing’) against the categories. Figure 4 shows the results produced by the

CDB. The US locations have been grouped by state, and the states have then been ranked by the

number of hits per state. The user can drill-down to see the number of hits per US location by clicking

on a US state. The report allows for easy comparison of US locations by fishing popularity 11. In short,

CDBs allow the user to catalogue a type of item of their own choosing (by feeding a classification

scheme to the CDB), whereas faceted search engines catalogue a certain type of item pre-chosen by the

10 Kartoo’s results, not shown for brevity, are similarly haphazard.11 For now, the reader can assume the results produced are sensible - in §6 EXPERIMENTAL EVALUATION we evaluate the validity

of the results produced by the CDB, and find that the CDB produces results that are correlated with United States Fisheries and Wildlife service data on fishing popularity in those states.


Page 12 of 93

developer of the catalogue. The CDB presents a new way of creating faceted search engines. By using

the taxonomies themselves in creating the CDBs we introduce a level of flexibility and scope of

application that is materially beyond that of present systems.

As should be evident from the discussion above, the CDBs results are aggregates for various

categories, and are not record-oriented. For instance, for the term “wind power”, a traditional search

engine would lead the user to the document “Wind power for the home”. In contrast, a CDB

exploration would suggest that “wind power” is a greater concern in Idaho than it is in Mississippi, as it

is more prevalent in documents in the Idaho category than it is in documents in the Mississippi category.

For a thorough comparison of aggregate versus record-oriented document search results, see [31]. In

this paper we aim to show that useful category-comparative aggregate information, heretofore

unobtainable, can be delivered by an appropriately constructed CDB.

Our work can, in some sense, be seen as partially analogous to work in the field of Online Analytic

Processing (OLAP) [33, 42]. In OLAP, structured data is organized by category and aggregated: for

instance, to find total sales by product, customers by city, or complaints by franchise outlet. OLAP

technologies are widely deployed in practice – examples include Microsoft Excel PivotTables and

PivotCharts, Microsoft Access CrossTab reports, SAS JMP Overlay Charts, and similar aggregate charts

in special purpose reporting tools like Crystal Reports, Cognos, Oracle OLAP, Hyperion Analyzer (now

part of Oracle), Microsoft Data Analyzer, and other products. Rather than providing tabular and

graphical aggregates of structured (tabular) data, CDBs produce category-by-category aggregate

statistics from document collections. Producing aggregate statistics from unstructured text data is not

new. For example, a pioneering system by Spangler et al. [6, 53, 121, 122], computes aggregate

statistics for various categories and features of documents in a document collection in order to provide

exploratory insights into the constitution of the document collection, and in order to infer relationships


Page 13 of 93

between categories found in the document collection. Their process, which is implemented in IBM

Unstructured Information Modeler12, involves first finding a subset of documents to analyze (e.g.

problem tickets relating to a particular brand). Summary statistics – word and phrase occurrence counts

– are then computed for each document in the collection. The documents are clustered using a variation

of the k-means algorithm, and one or more user-editable taxonomies are created which categorize the

document set. Finally, the relationships between various categories are analyzed to determine

correlations – for example to determine whether a particular call center representative has substantially

more unresolved tickets than their colleagues, or to determine whether particular product complaints are

associated with a particular reseller. The process used to construct CDBs is markedly different. CDBs

are constructed by populating each category in a user-supplied taxonomy with the n most relevant

documents for that category. The categories can then be ranked based on the number of hits in each

category for a user-specified search phrase. CDBs allow users to rank categories by hit-count or term

frequency for a chosen phrase, using a limited set of only the most highly relevant documents for each

category as source data. Using only the most relevant documents from each category allows the user to

compare categories against each other, using an arbitrary comparison metric chosen by the user (e.g. a

user might compare the relative frequency of a particular word or phrase, across categories).

12 http://www.alphaworks.ibm.com/tech/uimodeler


Page 14 of 93

4. PROCESS

The process of constructing and using a Categorized Document Base can be separated into the following

phases:

1. Creating a classification scheme

2. Populating the classification scheme with documents

3. Creating an aggregate search term (comparison metric)

4. Determining the aggregate results for each category

5. Integrating the aggregate results per category with external data for the categories

6. Collaborative annotation of CDB results, if users in a team wish to share their comments on

particular categories of interest they have found, with each other

We discuss each of these phases in depth in the following subsections:

4.1 Creating a classification scheme

Classification schemes are often referred to as hierarchies, taxonomies, or coding schemes. We

will use these terms interchangeably in this paper. A classification scheme can be created by

importing any existing taxonomy, or defining a new taxonomy13. Table 1 provides examples of

some popular classification schemes. The administrator of the CDB may use one of the

provided classification schemes, or may import or create a proprietary classification scheme,

such as a product hierarchy from a product catalogue. Use of a standard classification scheme is

helpful as other data providers typically provide statistics using codings from such classification

schemes, and these statistics can be easily integrated with category-specific aggregates from the

CDB (see Section 4.5 later). For example, profit margins by industry, coded according to the 13 Though, as we will see later (§4.2), certain taxonomies are more amenable to analysis by a CDB – specifically, taxonomies with

relatively unambiguous category descriptors are better suited to analysis using a CDB – while others are, at the present time, not.


Page 15 of 93

NAICS classification, are provided by the United States Internal Revenue Services, and could be

integrated with aggregate document statistics for each industry from the CDB. Popular

taxonomies can be organized by type (e.g. product hierarchies, industry hierarchies, place

hierarchies, activity hierarchies, time hierarchies, …) to aid in targeted exploration (see Table 1).

4.2 Populating the classification scheme with documents

The Categorized Document Base is created by populating the classification scheme with relevant

documents. Our method involves iterating through every category in the classification scheme,

and providing the category name, or other identifying features of the category14, as a search term

to a standard Information Retrieval tool or search engine (such as Yahoo, Google, Lycos,

Medline, Lexis/Nexis, etc.). The search engine returns a list of matching documents, and the

most relevant n documents returned by the chosen search engine are stored under that category

in the CDB. The full text of each document is stored. Note that a document may be assigned to

more than one category, if it is relevant to more than one category. However, search engine

rankings usually ensure that the documents that are most relevant to a given category do not

14 As category names may be ambiguous, it is preferable that additional identifying features of the category be specified, or some other

mechanism be employed to obtain documents for the correct category. For example, the place “Reading” in Pennsylvania, in a

location taxonomy, is different from the category “reading” in an activity taxonomy, so documents from these two categories should

not be mixed. Supplying the search term “Reading” to, for instance, Yahoo, would typically return documents from both categories,

whereas the CDB should only store documents relevant to the place “Reading” when populating documents for “Reading,

Pennsylvania” into the CDB. One method for disambiguation is to append the parent category name to the child category name: for

example “reading”, becomes “Reading, Pennsylvania” or “reading activity”. Capitalization may be used to distinguish a proper noun

from a common noun and using a quoted phrase to ensure co-occurrence of the city and state name reduces ambiguity further. Also,

because modern search engines often consult recent search history for the purposes of results personalization and query

disambiguation (see Google’s United States Patent Application 20050222989), the results returned to the CDB are likely to be further

disambiguated as the CDB is repeatedly requesting search results for similar categories. A vast number of other disambiguation

techniques exist [3, 62, 116] and can be used.


Page 16 of 93

appear in the top results for any sibling category. Figure 5 shows schematically how categories

in a taxonomy are populated with documents to create the CDB: in this example, a taxonomy of

markets (industries) is populated with the most relevant documents for each industry.

The reader may be concerned that web-pages are long and information on multiple

categories may co-occur within the same web-page, potentially leading to the page being

multiply classified, and requiring the use of short snippets of text from each page to ensure that

the text in the document set for a category relates only to that category. While this concern is

reasonable, it should be noted that, by nature, our process populates each category only with the

few most relevant documents for that category, from a search engine. Each document in that

category should be wholly (or at least predominantly) relevant to the category provided the

category name is unambiguous and sufficient content exists on the internet for that category to

allow the search engine to easily retrieve highly relevant documents for that category. Consider

an algorithm like Google’s PageRank [7], retrieving documents on the category “Blacksburg,

Virginia”. Documents that have content on Blacksburg Virginia alone and have inbound

hyperlinks referring to Blacksburg Virginia alone will have higher rank than documents that also

discuss Philadelphia, Pennsylvania or have inbound hyperlinks from Philadelphia, Pennsylvania.

Thus, by using only the top-ranked search engine hits for each category, our process typically

ensures that each document for that category predominantly refers to that single category, and

therefore the use of snippets is ordinarily not required.

Having said that, we must qualify our explanation by pointing out that a requirement of a

well-constituted CDB is that categories are unambiguous and sufficient documents exist for even

obscure categories. This is because, when either the category name is ambiguous, or insufficient

documents exist for obscure categories, the documents in the document base for a category could


Page 17 of 93

contain irrelevant information, or information for multiple categories. Consider the United

States Patents and Trademarks Office (USPTO) patent classification scheme, which contains

approximately 155,000 classes and subclasses. Our analysis of the USPTO scheme revealed

approximately 2,650 ambiguous categories: sub-categories of the same name that occurred under

different parent categories. Major offenders were “CLASS-RELATED FOREIGN

DOCUMENTS”, “MISCELLANEOUS”, “PLURAL”, “ADJUSTABLE”, “PROCESSES”, and

dozens of others, each of which appeared under multiple parent categories. Clearly, these

category names would have to be disambiguated in order to intelligently populate the category

only with relevant documents for the correct sub-category. For obscure categories, the content

available on the specific category is sufficient sparse that only a few documents mention the

category, and those that do also mention other sibling categories. In the USPTO scheme, which

contains dozens of obscure categories (e.g. “Using mechanically actuated vibrators with pick-up

means”), a general-purpose search engine such as Yahoo is unable to find documents highly

relevant to the specific category and yields, instead, pages containing excerpts from the USPTO

scheme itself, mentioning dozens of other USPTO categories, and not particularly relevant to

any one USPTO category. It can be concluded that use of the USPTO scheme with a general-

purpose search engine, such as Yahoo, will not yield a well-constituted CDB, as the USPTO

scheme does not supply unambiguous category descriptors, and, further, a general-purpose

search engine will not find sufficient relevant documents for obscure USPTO categories.

Turning to another classification scheme, the United States General Services Administration

Geographic Locator Codes (US GSA GLC) list of locations, we found the problems encountered

above with the USPTO scheme to be surmountable. While many ambiguous category names

exist in the US GSA GLC list of locations – for example, there are cities named “Philadelphia”


Page 18 of 93

in PA, IL, MO, MS, NY, and TN – this was easily rectified by simply appending the state name

to the location name in order to disambiguate. Furthermore, a small sampling of various obscure

towns indicated that sufficient pages exist on the internet (e.g. local tourism pages for those

towns) to yield highly relevant results when we requested the top few hits for the specific town

and state on a general-purpose search engine such as Yahoo. We therefore proceeded with the

use of the location classification scheme as a viable taxonomy on which to conduct further

experiments to more fully assess CDB usefulness and robustness. As mentioned, we were

careful to always include the state name with the city or town name, when populating each

category, to reduce ambiguity.

As we have seen, in the absence of sufficient content relevant to a specific category (as in the

USPTO case described above), general purpose search engines cannot populate each category

with wholly relevant documents. Furthermore, it can in certain circumstances be a challenge

(again as in the USPTO case), to provide unambiguous search phrases for each category,

especially when tens of thousands of categories need to be populated. This can result in

irrelevant documents being included in the document set for a category. A human user might

consider removing documents from categories, editing the documents, or reassigning them to

different categories, if they believe the document was incorrectly assigned, is from an unreliable

source, is inaccurate, or inappropriate to that category for some other reason. For example, a

user may remove document 6780.html from the “Chicago” sub-category of “Illinois”, if they

notice the document pertains to the movie “Chicago”, rather than to the place. In another

example, a user may notice that document 13581.html has been assigned to both categories

“Tilapia” and “Trout”. On further investigation, the user realizes that the document is merely an

alphabetic grouping of fish, and therefore contains content relevant to more than one type of


Page 19 of 93

fish. The user could decide to remove the document from the CDB, and replace it with two

edited versions – one that refers only to tilapia, and another that refers only to trout – so that the

trout-related content of the document doesn’t impact the “Tilapia” category, and vice versa.

Similarly, the user may extract advertisements or sponsor links from a web-page document, if

the user believes these items introduce content not relevant to the category.

While human editing of categories is permissible, it has its drawbacks. Firstly, it is

extremely time-intensive, particularly when thousands of categories must be edited. Secondly, if

not done in a systematic and consistent fashion, different biases can be introduced into different

categories in the document set. We therefore do not advise human editing of categories. Rather

we recommend that, prior to any taxonomy being fed to the CDB, the taxonomy first be

assessed, and any taxonomy with ambiguous or obscure categories be shunned, to avoid the need

for manual human editing. In our experiments (Section 6), we avoided the need for human

editing by using a classification scheme that could be unambiguously populated to a satisfactory

extent without any manual intervention. Specifically, by automatically appending the state name

to the location name for every location, we could ensure that, for instance, results pertained to

the location “Chicago, Illinois”, rather than to the movie “Chicago”. Further, for the reasons

described above, we could be confident that the few top ranking documents for each category

were highly relevant to the category alone, did not contain mention of sibling categories, and

therefore did not need to be manually excerpted (‘snippeted’) nor edited for relevance.

4.3 Creating an aggregate search term (comparison metric)

The Categorized Document Base is queried by specifying a search term (which can be thought of

as a ‘comparison metric’), and optionally some additional parameters. The search term typically

consists of one or more words, and aggregate statistics computed for the search term in each


Page 20 of 93

category of documents allow the user to compare categories. In a simple case, the user asks the

CDB to compile aggregate statistics for each category that show the relative frequency of the

search term in each category. Composite search terms can be created so that counts tally hits on

any of the terms, all of the terms, or only the exact phrases (i.e. have the terms in that specific

sequence).

There are a variety of known ways to augment a search term to be used for purposes of

searching with an Information Retrieval engine [36]. These techniques are often known as query

expansion (or query augmentation). The expansion is typically intended to improve precision

and/or recall, by finding “hits” that do not match the literal search term. For instance, for the

search term “dry” a user may be offered the following expansions:

Synonyms: exsiccate, dehydrate, dry up, desiccate

Antonyms: wet, moisten, wash, dampen

Related words: desiccant, drier, drying agent, siccative

Stems / truncations (e.g. “dry”), and derived / inflected forms (e.g. “dried”, “drier”)

Troponyms / Hypernyms / Hyponyms: a troponym is a word that denotes a manner

of doing something – for example “dehydrate” is a manner of “drying”, so it may be

helpful to search on “dehydrate” when searching on “dry”. A hypernym is a word

which denotes a subclass of a superclass: for example, “freeze drier”, “vacuum drier”,

“spray drier”, and “oven” are all hypernyms of “drier” since they are all types of

drier. The superclass (“drier”) is called a hyponym.

Meronyms / Holonyms: a meronym is a word that names a constituent part of a larger

item. The word for the larger item is called a holonym. E.g. “fan” is a meronym of


Page 21 of 93

“oven” as a fan is a constituent part of an oven. Similary, “oven” is a holonym of

“fan”. As fans may be used for drying, and fans are part of ovens, a user searching

on “drying” may also be interested in “fan” and “oven”.

It is also desirable that the user select the specific sense of the word that they wish to search.

This is known as query disambiguation. For instance, ‘dry’ has, inter alia, the following

different senses: ‘lacking moisture’ (as in “dry clothes”), ‘ironic or wry’ (as in ‘dry humor’), or

‘having a large proportion of strong liquor’ (as in ‘dry martini’) [38]. In the absence of a sense-

sensitive search engine, a monosemous synonym – i.e. a synonymous word that has only a single

sense – can be chosen, provided the synonym is in sufficiently popular use. For example

‘dehydrate’ is preferable to ‘dry’, as the latter is highly polysemous. Though ‘dehydrate’ is less

commonly used than ‘dry’, it is still in sufficiently popular use that we can expect to regularly

find hits for it in a document collection. Compare to ‘siccative’ which is monosemous but rarely

used (seldom found in a given document collection), and may therefore not be appropriate as an

alternative search term to ‘dry’.

The user is able to perform both query expansion and query disambiguation in our prototype

via a run-time interaction we have provided with the WordNet lexical database [38]. Figure 6

shows a pop-up screen from our prototype software which allows the user to perform query

disambiguation or query expansion for their chosen search term using WordNet. As Figure 6

illustrates, the user is shown synonyms and related terms for their search term, to allow them to

choose a less ambiguous search term in the event that their chosen term is a highly ambiguous

term. For example, a user contemplating the use of the term “dry” (which has many senses –

e.g. “dry skin” vs “dry humor”), might instead choose to use the related term “desiccant” which

is less ambiguous. As mentioned before, a caveat, though, is that “desiccant” is comparatively


Page 22 of 93

rare, and perhaps less likely to produce a significant number of hits if, for instance, document

authors prefer the term “drying agent” to “desiccant”. An ideal search term is one that is both

unambiguous and in common usage.

While we provide run-time, user-driven disambiguation and query expansion facilities via

WordNet, as shown in Figure 6, we do not currently prescribe nor provide any additional

automatic means for semantic-sense-sensitive (context sensitive) search, though many are

available: see, for example, the literature on Word Sense Disambiguation (WSD) [3, 62, 116].

In our experiments, reported in Section 6, we relied on appropriate word choice (i.e. choice of

monosemous and commonly-used words) by the end-user with computer-assistance, using our

integrated WordNet feature as illustrated in Figure 6, where necessary.

It might be suggested that a possible, though labor intensive, means of ensuring that only

documents pertaining to the correct sense of the category name are associated with that category,

is to have human readers, skilled in linguistics, manually remove documents that pertain to

homonyms (i.e. different sense of a word or phrase, which share the same spelling). However,

manual intervention quickly becomes impractical, given the enormous number of documents in

the CDB, and use of our computer-assisted word choice facility (Figure 6), or supplemental

automatic word sense disambiguation (WSD) techniques, as suggested above, is preferable.

4.4 Determining the aggregate results for each category

To create aggregate statistics for all categories in the CDB, the search terms from the previous

section are compared to the documents in each category. Figure 7 shows the basic process: a

search term, in this case, “foam reduction”, is run against all documents in each category – in

this case, only the “Inks” sub-category (under the “Printing” category), has hits.


Page 23 of 93

A number of basic statistics can be computed for every category in the classification scheme:

total number of hits (word / phrase hits) in top n documents for category

number of documents with one or more hits, amongst top n documents in that category

hits per thousand words (a.k.a. “relative term frequency”), for top n documents in that

category

Table 2 shows the absolute number of hits for the terms “network”, “monitoring”, “devices”,

etc, in the document categories “Minibuses”, “Busses’, “Automobiles and Cars”, etc, for a

sample CDB. Shading is used to indicate where the word occurs with unusually high or

unusually low frequency in the category.

More advanced aggregate statistics can also be created. For example, we can compute the

relative prevalence (a.k.a. lift) of the search term in that subcategory, as compared to similar

categories. A lift of 2 indicates that a word is two times as prevalent in the current category as it

is in other categories chosen for comparison – i.e. it is found two times as often as expected. A

lift of 1 indicates the word is as common as expected: its prevalence is the same in that category

as it is on average in the other categories chosen for comparison. A lift of ½ indicates the word

is half as common as expected. Lift is a useful indicator of interestingness [96]. For example,

common words like “small” may have a high number of absolute hits in a category, but this may

not be interesting, as the relative prevalence, when compared to other categories, may not be

significant or unusual, if the other categories also have a similar number of absolute hits for

“small”.

Formal tests of statistical significance, such as chi-squared tests, can also be conducted to

determine whether the relative prevalence (difference of actual prevalence from expected) is


Page 24 of 93

statistically significant. A category is ‘interesting’ or ‘unusual’ if it has significantly greater

prevalence of the search term than expected or, alternatively, if it has significantly lower

prevalence of the search term than expected. For example, documents in the category

Philadelphia may be interesting as “murder” is mentioned more frequently than documents

pertaining to other cities in Pennsylvania. Similarly, documents in the category Erie may be

interesting as “murder” is mentioned comparatively less frequently than for other cities in

Pennsylvania.

Figure 8 shows the relative prevalence of the terms “smoothness”, “strength”, and [“wet” or

“damp”], in various segments of the stone quarrying industry. Three bars are shown for each

industry: from left to right, the three bars for that industry are “smoothness” for that industry,

“strength” for that industry, and [“wet OR damp”] for that industry. As shown by the left-most

bar for each industry, “smoothness” is mentioned almost twice as frequently in Crushed and

Broken Limestone mining, compared to the other segments. “Strength” (the middle-bar for each

industry in the chart) is mentioned almost twice as frequently in Dimension stone mining

compared to the other stone quarrying industry segments. Finally, looking at the right-most bar

for each industry in the chart, we see that [“Wet or “Damp”] is mentioned half as frequently in

Crushed and Broken Granite Mining, than in the other segments. The baseline (i.e. average

absolute number of hits out of total words in the documents) in each category for each search

term is obviously relevant, since a moderate lift, off a low baseline (e.g. a baseline of one or two

absolute hits out of thousands of words in the documents), would not be statistically significant.

It is therefore important that chi-squared tests, of the relevant degree of freedom, be applied to

ascertain whether the lift is statistically significant given the baseline. In Figure 8, the various

baselines for each term are omitted for readability, but different shading is used to indicate lift


Page 25 of 93

that is statistically significantly higher than expected (at the 95% confidence), or statistically

significantly lower than expected (at the 95% confidence).

Aggregate statistics for any parent category (that is, a category that has sub-categories), can

be obtained either from:

1. the document collection for that parent category (e.g. a document collection obtained

by finding all documents relevant to parent category “Pennsylvania”), or from

2. aggregates of statistics from the document collections of its children (e.g. aggregates

of statistics from all documents relevant to child categories “Philadelphia”,

“Pittsburgh”, “Erie”, etc. which are children of the parent category “Pennsylvania”)

Both types of statistics are interesting, since the former is obtained from documents directly

related to the parent category, and the latter is obtained from documents which relate to

descendents (i.e. children, grandchildren, etc.) of that category.

Figure 10 shows a hierarchical drill-down view of hits per category, for the word

“biodegradable” across various categories in the United Nations Standard Products and Services

Code (UNSPSC)15.

The absolute number of hits can also be normalized or standardized in various ways. For

example, a large number of hits for “dogs” in Los Angeles CA, as compared to, say, Blacksburg

VA, is unsurprising, as Los Angeles CA has a substantially higher population. Normalizing the

hits, by dividing by the population size in this case to produce per-capita popularity, can provide

an alternative statistic for comparison.

15 http://www.unspsc.org/


Page 26 of 93

Statistics can also be calculated for combinations of categories taken from different

classification schemes, in much the same way as On-Line Analytical Processing (OLAP)

statistics are calculated on structured data. For example, taking a CDB of documents

categorized by both Place and Time, we could, for instance, find the statistics for “New York,

September 2001” documents (i.e. documents that appear both under “New York” and under

“September 2001”). This set of documents would have high hits for “trade center”, compared to

documents for, say, “Boston, January 1992”), as September 2001 was the time of the World

Trade Center attacks in New York.

More sophisticated numerical scores for each category can also be computed, using the data

in the document base alone, or in conjunction with other data sources. In general, a numeric

score for a category is any quantitative measure that can be derived from the contents of (i.e.

documents in) that category, or from, or in combination with, a structured data source that

associates that category with some statistic (e.g. ‘population’ is a statistic for the category “Los

Angeles, CA”, that can be obtained from a structured data source).

For our software implementation, we initially implemented a Microsoft Excel interface,

which would download a comma-separated-value text file from the server farm, and allow the

results to be viewed graphically, using tree views and charts. The Excel interface included

intuitive expandable and collapsible tree-views of the various taxonomies, to allow easy

visualization of the aggregate statistics for each category in the CDB by end users. Figure 9

below shows the Excel-based interface – in this case the user, a molecular engineer from a

chemical company, is viewing the hits for the phrase “foam reduction” amongst a number of

product categories in a taxonomy of different product types, in an attempt to identify relevant

product applications for a new foam reducing surfactant she has developed. Categories which


Page 27 of 93

have more hits than a defined threshold are shown shaded: in this case, the threshold is

arbitrarily defined as 2 hits per category. Our Excel interface was eventually retired, in favor of

a web-based interface (see Figure 10).

4.5 Integrating the aggregate results per category with external data for the categories:

A new kind of ‘mash up’

‘Mash ups’ are web-based services that weave data from different sources together, creating an

interesting and useful report from the synthesized data [14]. For example, data on coal reserves

by state from a public data source – such as the National Mining Association – can be integrated

with geographic data (e.g. from a mapping service like Google Maps) to create a visual map

showing which states have the highest coal reserves. Similarly, oil output for each state – for

example, obtained from the US Energy Information Administration – could be overlaid onto the

map to visually illustrate which states supply the most energy from fossil fuels.

Categorized Document Bases represent a new source of data on categories, and therefore

provide an additional source of data for mash-ups. A simple example of creating a mash-up

from a CDB would be to take the hits for ‘coal’ and for ‘oil’ in the document sets stored for

multiple states and integrate this with geographic data to create a visual map of how frequently

these terms are mentioned in the documents stored for each state.

Let us consider a more sophisticated example, which illustrates the mashing of data from

both an unstructured source (the documents in the CDB) and a structured source (a spreadsheet).

Assume we have a CDB populated with document sets for multiple industries. For market

research purposes, we may want glean information on those industries from the CDB, and show

it alongside information on those industries from other sources. Figure 15 gives a simple


Page 28 of 93

example of such a ‘mash up’: here, a molecular engineer has constructed a bubble-chart, using

our Excel-based prototype implementation16, to explore possible applications of a new

biodegradable compound her company has developed. In Figure 15, the Y-axis is the relative

prevalence (lift) for the search term “biodegradable” in each of the two industry categories

“Oilseed Processing” and “Plastics Packaging” – this data has been obtained from the CDB. As

is evident, Oilseed Processing has far greater relative prevalence for the search term. The asset

turnover and revenues for each industry were obtained from an external structured data source –

specifically, a spreadsheet obtained from the Internal Revenue Service (IRS) – and plotted as the

X-axis and bubble size respectively. From Figure 15 it appears that Oilseed Processing is a

relatively small industry, by revenue, compared to Plastics Packaging. Thus, while Oilseed

Processors are apparently very interested in biodegradable molecules (as shown by the high

relative prevalence of the term “biodegradable” in the document set for that category), sales into

that industry may not be lucrative, given its relatively small revenues. The ‘mash-up’ of

information from the CDB with information from the IRS has yielded thought-provoking

insights into the industries shown. Note that the CDB serves only as a useful heuristic for more

quickly finding possible solutions – further manual study is typically required to validate or

eliminate suggested solutions. In one of our commercial trials, molecular engineers and business

development managers at a chemical company considered 30 industries identified as most

promising by the CDB: the company was already operating in 7 of the identified industries, 12

industries were previously known but the company and its competitors did not operate in these

industries as they had already been found to be unviable, 3 industries were previously unknown

but further investigation showed they were infeasible, 3 industries were previously unknown and

16 Excel was used as it allowed us to easily create bubble charts, and it also allowed us to easily integrate financial data for various industries with the aggregate statistics for those industries from the CDB.


Page 29 of 93

feasible but deemed to be not promising, and 5 industries were previously unknown and deemed

highly promising.

The examples above illustrate that CDBs can be used to compose interesting ‘mash-ups’:

profound insights into the relationships between categories can potentially be illustrated by

showing aggregate hit results by category (from the unstructured text documents in the CDB)

alongside structured data (e.g. from databases or tabular text files) that are organized according

to the same coding scheme. Table 3 lists some examples of structured data, from both private

and public sources, that have been coded according to standard taxonomies mentioned earlier in

Table 1, and can therefore be integrated with the results of queries on the CDB. Given that we

are able to use these coding schemes to cross-reference CDB results for multiple categories with

existing structured data for those categories, abundant opportunities to create new mash-ups

exist.

4.6 Collaborative annotation of CDB results

In our trials with commercial organizations, our clients requested collaborative annotation

facilities, to allow business development managers and chemical engineers to share their

observations on categories of interest. We therefore rebuilt our visualization features, this time

using a web-based interface, and added collaborative annotation features to allow the technology

commercialization team to share their comments on interesting categories discovered by the

CDB. Figure 11 illustrates the collaborative annotation interface implemented for the CDB, and

shows users sharing comments on possible applications of a biodegradable molecule with foam

reduction properties. The CDB exploration and annotation software and interface were code-

named Sizatola, meaning “help find” in Zulu.


Page 30 of 93

5. TECHNICAL IMPLEMENTATION

In this section we describe the system architecture and the data structure for our Categorized Document

Base.

5.1 CDB System Architecture

Due to the data volume of the Categorized Document Base, which exceeds the capacity of a

single machine, we implemented a parallel processing architecture for the CDB, allowing it to be

distributed across a number of machines. Partitioning of the CDB across machines greatly

increases the rate of both document gathering and aggregate statistics compilation. A controller-

servant architecture was employed. The controller machine maintained a list of categories,

including the timestamp at which population last started (if it had begun) and ended (if it had

ended) for that category. In order to maintain data freshness, the documents in the category

could be periodically refreshed, for example, every week. Individual servant machines

requested categories from the controller. The controller assigned the servant a list of categories

to populate. If any of these categories do not populate within 24 hours, the controller allowed

them to be reassigned to another servant. If a category has been reassigned to 3 servants, and

still had not populated, it is flagged as problematic, so that a programmer could investigate why

the category was not populating.

Two queues of categories were maintained on the controller: a high priority queue, and a

low-priority queue. If there were any categories in the high priority queue, these were processed

first, otherwise the low priority queue was processed. A category was added to the high priority

queue if a user had attempted a search on that category – the high priority queue was intended to


Page 31 of 93

ensure that documents for categories that users are specifically interested in were imported to the

CDB as soon as possible. The low priority queue contains categories that no users had yet

requested, but that could conceivably be requested in the future, thereby ensuring that we had a

forward cache that could rapidly satisfy new requests.

The controller machine was always started first, so that it could listen for requests, for

category lists, from the servants. As each servant was started, it requested a list of categories

from the controller, then populated documents into each of those categories, and notified the

server as each category was completed. The servant requested new categories once it had

completed all of the categories in its current list. To populate an individual category, the servant

requested the top matches for that category from a search engine (e.g. top 10 documents in the

category, by searching Yahoo), and the servant then retrieved each of these high-ranking

documents, and stored the documents in an indexed database (the database structure is shown in

§5.2). The experiments reported in §6 made use of this indexed database structure. During our

experiments, we noticed that the index structure resulted in significant performance impediments

both in populating the CDB and in querying the CDB. Population using this index structure

occurred at approximately 40,000 categories (400,000 documents) per month. A single phrase

query to gather the aggregate statistics for 40,000 categories consumed 2 to 3 days per query.

We have therefore begun experiments with the storage of documents in simple directory folders,

with one directory folder per category, to determine if population and query performance can be

improved in this manner.

Returning now to the controller-servant architecture implemented, the optimal number of

categories for a servant to request depends on the size of the controller’s queue, and the number

and speed of the servants. If the controller’s queue is long, and there are a small number of fast


Page 32 of 93

servants, each servant should request many categories, to reduce the number of individual

requests for new category names to populate, to the controller. If the controller’s queue is short,

each servant should request only one category at a time, so as to maximize parallel processing of

this short queue amongst the servants. If too many categories were requested by a single

servant, and there were none left in the controller’s queue, other servants would remain idle

while the overloaded servant churned slowly through its queue. While we implemented only a

simple load-balancing scheme, it is clear that CDB population routines would benefit from

sophisticated load-balancing arrangements – the literature is replete with alternatives [87, 119].

For robustness, the data on each servant could be replicated on other servants, to ensure that

there is no single point of failure – if a server goes down, another server either holds the data, or

a team of servers will gather the data. We did not implement a replication scheme, though one

would be advisable for a production quality system.

When a user inputted a search term, and wished to explore document categories and compute

aggregate statistics for each category, the search term was received at the controller. The

controller determined which servants held the documents for each document category that the

user requested. The controller then contacted each relevant servant, and requested the aggregate

statistics for that search term, for the categories which the user requested, and which were on

that servant. This was an asynchronous process, meaning the controller did not block while it

awaited results, and instead continued with other tasks. When the servant had completed the

calculations, it contacted the controller with its results. If a server went down, it would complete

the requests in its queue only if they were recent (i.e. if the controller hadn’t reassigned them

already). Again, a replication scheme is advisable, but was not implemented. For instance, if

the servant did not complete its calculations within a specified time, the controller should contact


Page 33 of 93

an alternative servant that possessed the data. This would ensure that no bottlenecks would arise

when a servant went down.

A cache of query results (aggregate statistics) was maintained, to speed up repeat queries.

For example, if the number of hits on “warm” for the “United States Cities with State Name”

hierarchy had been recently computed, we only needed to re-compute the statistic for sub-

categories whose document collection has changed since the last query. For the US location

hierarchy (constituting approximately 40,000 categories), a complete cache of the number of

word hits and document hits in all categories for a single search term, stored in a Comma-

Separate-Value (CSV) file or Excel spreadsheet, consumes approximately 2MB of storage space.

The CSV file provides near-instantaneous response times to repeat queries involving the same

search term, in cases where the document-base is unchanged since the previous query. The

query response time for repeat queries increases roughly proportionately17 with the number of

documents that have changed since the last query using that same search term.

In our experiments (see §6), we made use of ten Pentium machines, with 2.6 Ghz or greater

processors, 1 GB of RAM, and 500 GB of hard drive space each, totaling approximately 2

Terrabytes of storage. The population and exploration routines were written in Python, and all

data was stored in a MySQL 5.0 databases. We imported the classification schemes shown in

Table 1, and then populated the Categorized Document Base by obtaining the top 10 documents

for each category from Yahoo.com. The taxonomy importation process alone took many weeks,

as each classification scheme was in its own format, and needed to be imported into a standard

tabular format. The full text of all HTML documents was imported. Other document formats,

such as occasional PDF and Word documents, were ignored. After taxonomy importation was

17 The increase is not exactly proportionate since the size of each new document varies, and document size also affects the rate at which term occurrence counts are computed.


Page 34 of 93

completed, we began to populate the various categories with documents. A total of

approximately 240,000 categories (2.4 million documents) were populated, over a period of 6

months in the latter half of 2007. For the experiments reported in §6 we made use of only a

subset of these categories – approximately 40,000 categories in the taxonomy “United States

Cities with State Name” which were gathered using 6 machines over a period of a few weeks in

summer 2008. The remaining taxonomies and categories were used in industrial trials with

client organizations (see §7), or for initial experimentation that was later abandoned18.

5.2 CDB Database Structure

The database tables used by the CDB can be divided into three major areas:

1. Tables used by the Population routines (§4.2) to represent and index the documents

gathered by the CDB – see Figure 12.

2. Tables used by the Query routines (§4.4) to store and cache queries and their results

(aggregate statistics) from the CDB – see Figure 13.

3. Tables used to implement the Collaborative Annotation interface (§4.6) – see Figure 14.

As mentioned earlier, we found that use of a relational database structure by the Population

routines for the representation and indexing of documents resulted in significant performance

impediments. As clients found the 2 to 3 day delay in result compilation to be excessive, we

have begun experiments with the storage of documents in simple directory folders and our initial

results indicate that we can obtain aggregate statistics for 20,000 categories within a few minutes

using a traditional file storage scheme.

18 For example, for the reasons reported in §4.2, we observed that the quality of documents in the US Patent taxonomy was poor, and this taxonomy was therefore abandoned as unusable.


Page 35 of 93

6. EXPERIMENTAL EVALUATION

To evaluate the our CDB approach, we set up a number of experiments, in varied industries, including

some industries that relate to natural phenomena, and some that relate to commercial phenomena. We

observed that our studies could be segregated into:

population-sensitive studies (§6.1) – such as burgers, pizza, and hotels – where the

phenomenon varies by population

versus

non-population-sensitive studies (§6.2) – such as solar, wind and rain – which are governed

by natural phenomena, rather than driven by human population.

In all experiments, we made use of a taxonomy of United States place names, taken from the United

States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic

Names Service19. This taxonomy was then populated with the ten most relevant documents for each

place, from Yahoo20. A total of 27,547 individual places were investigated, and the ten most relevant

documents acquired for each. In each experiment, we gathered summary statistics for industry-specific

search terms across these top-ranked documents for each location. We then aggregated the data by

state, and made use of a public, structured data sources for each state to validate whether the findings

from the exploration of the document-based data in the CDB were sensible. In all cases we used

Pearson correlation (Pearson’s ) [102]21 to determine whether the ranking of states as provided by

aggregate statistics from the textual CDB was correlated with the ranking we obtained for those states

19 http://www.gsa.gov/glc20 http://www.yahoo.com/21 We chose to use a Pearson ρ [102] instead of Spearman’s rank correlation coefficient (Spearman’s ρ) [115, 123], since Spearman’s

measure does not cater for ties. Pearson’s ρ is equivalent to Spearman’s rank correlation coefficient, when computed on ranks. In many cases, we also computed Kendall’s tau rank correlation coefficient (Kendall’s ) [71, 72]. However, for brevity, we have not shown Kendall’s in our results, as Kendall’s metric consistently showed similar statistical significance to Pearson’s ρ so does not provide substantial additional information.


Page 36 of 93

from an alternative structured, quantitative data source, and hence whether the aggregate statistics from

the textual CDB provided a reliable proxy for alternative, widely-accepted quantitative data22.

6.1 Population-Sensitive Studies

We regard population-sensitive studies as those where the phenomenon is likely to vary with

population. To further investigate population-sensitive industries, we selected the top 100

American franchise corporations from Entrepreneur Magazine’s Franchise 500 listing23 for the

year 2008. As some of the franchises were in the same industry, we selected only the largest

franchise from each of the 37 unique industries found24. For each industry, we visited the

website for the largest franchise corporation in that industry, viewed the home-page source code,

and selected up to 4 META tag keywords listed by the company, which described the company’s

main product or products25. We then ran each keyword against the above-described CDB

(populated with documents for each United States location from the US GSA GLC), aggregated

the hits for each keyword by state, and ranked the states from states with the most hits per 1,000

words26 for that keyword down to those states with the least. To determine whether the CDB

ranking correlated with independent data about the popularity of each product in each state, we

visited a commercial data provider, InfoUSA27. For each corporation, we submitted the name of

22 The source data used for our experiments – including summary statistics for each keyword in each state obtained from the CDB, and the statistical calculations we performed – is too large for inclusion here, but is available separately for download from the authors, should the reader wish to validate our findings. Due to copyright restrictions on the quantitative data which we obtained for each state from the external data sources, the authors are unable to redistribute the external data sources. However, we have provided hyperlinks in footnotes in all cases, to allow the reader to obtain the data themselves. Also, owing to copyright restrictions on the Yahoo search results we used to populate the CDB, the authors are unable to make the source documents available, but the reader can compile a similarly-arranged data set using the techniques described in this paper, albeit for a different point in time.

23 http://www.entrepreneur.com/franchise500/24 In the case of Dunkin’ Donuts (coffee and donuts franchise), ranked 3 rd in the Franchise 500, we noticed that the franchise had a

predominantly East Coast penetration in the United States, attributable to the franchises’ unique and peculiar roll-out strategy, and we therefore substituted it with Starbucks corporation, which had a more representative national penetration of stores across all states.

25 In the rare event that META keywords were not available, we manually decided appropriate product keywords for the company.26 We divided raw hit count for the state by number of total words in the documents for that state, to remove population biases which

result from some states having more locations, and hence more documents and words, than other states.27 http://www.infousa.com/


Page 37 of 93

the corporation to InfoUSA, and obtained a count of the number of outlets operated by the

corporation in each state. To remove population biases and attempt to ascertain relative demand

for each product in each state, we divided the number of outlets in the state by the population of

that state from the US Census Bureau28, and then ranked the states by number of outlets per-

capita for each corporation. Finally, we compared the ranking of states from the CDB (relative

term frequency for the product keyword in each state29) to the ranking of the states by number of

outlets per-capita operated by the corporation in each state. Our results are shown in Table 430 –

correlations significant at the 90% confidence level are shown in bold. For brevity, and because

these experiments did not generate significant correlations, we have shown only a small

representative selection of industries. As can be seen in column 5 of Table 4, all industries were

indeed strongly population-sensitive as we had expected, with franchise outlets per state being

highly correlated with population of that state. Column 4 of Table 4 shows, however, that

statistically significant positive correlations between term frequency for the search term and per-

capita franchise outlets per state for the industry were seldom found. This indicates that the

CDB is not a credible implement for discerning differences in per-capita demand for different

products between states in the population-sensitive industries in our experiments.

28 http://www.census.gov/popest/states/NST-ann-est.html29 Originally, we computed raw total hit count for the product keyword in each state. However, we found that the strong correlation

between raw total hit count and number of franchise outlets was spurious, since states with more locations had more documents, and hence more words and keywords. The authors are grateful to the reviewers for pointing out this issue. We therefore made use of an alternative metric – term frequency (hits per 1,000 words) – which is not skewed by population.

30 As the table is large, it has been split into 4 parts for readability.


Page 38 of 93

6.2 Non-Population-Sensitive Studies

We regard non-population-sensitive studies as those which are governed by some natural

phenomenon, rather than by human population. Through a process of group brainstorming, we

identified a short-list of non-population-sensitive industries. For each industry, we identified an

independent external data source that provided state-specific metrics for that industry. We used

the CDB system to rank each state by the metric in question, and then we compared these

rankings with rankings established by the independent external data sources. Following are the

external data sources we gathered for each industry:

Wind energy : We obtained data from Department of Energy (DoE), Energy Efficiency

and Renewable Energy, on current installed wind power capacity, in Megawatts, per state

as at Jan. 31st 200931. We also obtained data from National Renewable Energy

Laboratory (NREL) annual average wind resource estimates, in Megawatts, in the

contiguous United States32.

Solar energy : We obtained data on annual average daily solar radiation, in British

Thermal Units (BTUs) per square meter for a 10 tube solar collector, for each US state33.

Rain : We obtained data from the National Oceanic and Atmospheric Administration

(NOAA), on total inches of precipitation for 2008, for each state34. We also obtained data

from the National Atlas on average annual precipitation per square mile for each US

State from 1961 through 199035.

31 http://www.eere.energy.gov/windandhydro/windpoweringamerica/wind_installed_capacity.asp32 http://rredc.nrel.gov/wind/pubs/atlas/maps/chap2/2-01m.html33 http://www.thermomax.com/usdata.htm34 http://cdo.ncdc.noaa.gov/cgi-bin/climaps/climaps.pl?directive=quick_search&subrnum=35 http://nationalatlas.gov/printable/precipitation.html


Page 39 of 93

Fishing : We obtained data from the United States Fish and Wildlife Service (USFWS),

on the number of non-resident fishing licenses issued per state36.

Coal, Gemstone, and Gold : We obtained data from the National Mining Association

(NMA) State Fact sheets, on the total number of mines, total production, and total

revenue, for coal, gemstones, and gold in each US state37.

Forests : We obtained data from the National Forest Service (NFS) on total forest acres

under administration in each state38.

Oil : We obtained data from the Energy Information Administration (EIA) on oil

production for each state39.

Mountain Climbing : Data on the highest elevations in each state was obtained from the

United States Geological Survey (USGS)40.

Eco-tourism and Gambling : Data on employment in these specialty occupations was

obtained from US Bureau of Labor Statistics Occupational Employment Statistics (OES)

state cross-industry estimates41.

After gathering external data for each industry, to allow us to rank and compare the states for

that industry, we compared the ranking of states using the external data, to a ranking of each

state by term-frequency for a search term for that industry using the CDB. Table 5 (column

3) shows the search term keywords we chose for each industry. In each case we computed

36 http://www.fws.gov/news/newsreleases/R9/A2D9B201-0350-4BD4-A73477A70A25FC69.html?CFID=3980850&CFTOKEN=9293532037 http://www.nma.org/statistics/states_econ.asp38 http://www.fs.fed.us/publications/documents/report-of-fs-2002-low-res.pdf39 http://tonto.eia.doe.gov/dnav/pet/pet_crd_crpdn_adc_mbbl_a.htm Note: this document is no longer accessible on the EIA website

but can be found in Google’s cache by searching for the URL using Google.40 http://erg.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html41 http://www.bls.gov/oes/oes_dl.htm


Page 40 of 93

(see column 4) the correlation – using Pearson’s – between the state ranking using the

external data for that industry versus the CDB ranking of states by search term frequency

(hits for the search term per 1,000 words) for that industry. We also computed (column 5)

the correlation between the external data and the population for that state, to confirm whether

the industry was indeed not-population-sensitive. As before, correlations significant at the

90% confidence level are shown in bold. For instance, for the mountain climbing industry,

we find a strong (0.65) correlation between the ranking of states by USGS Elevation Data

and the ranking of states by term frequency for the term “mountain climbing”. As is evident

from Table 5, for non-population sensitive industries, we regularly find statistically

significant correlations between the ranking of the states by relative term frequency of the

search term using the CDB, and the ranking of the states using the external data. We

conclude that the CDB is an approximate, but viable, means of comparing states for the non-

population sensitive industries in our experiments, as the CDB rankings are plausible proxies

for rankings of the phenomena obtained from external data.

Given the large number of correlations run, the Bonferonni correction [12, 13] needs to

be applied to determine whether each result, when considered alone, is statistically

significant. After dividing the desired statistical confidence level (p = 0.1) by the number of

experiments (17), the actual p-values obtained are, in most cases, sufficiently low to

conclude that the correlation is significant.

To determine whether the set of correlations, when taken together, is significant, a chi-

squared test ( test) can be performed42. For instance, at a 90% confidence level it is likely

that 10% of studies performed would, by chance, indicate correlations. A -test can be

42 The Bonferonni adjustment is notoriously conservative and lacking in power so a -squared test here is dispositive.


Page 41 of 93

performed to reveal whether the actual number of correlated studies is significantly different

from 10%. Of the 17 non-population-sensitive industries investigated, 13 produced

statistically significant positive correlations. Though admittedly a small sample, the p-value

for this -test (13 actual correlations obtained in 17 studies compared to 1.7 correlation in 17

studies expected) is substantially less than 0.01, indicating strong statistical significance. We

conclude that the CDB appears to perform satisfactorily for non-population sensitive

industries.

6.3 Discussion

The CDB clearly appears to produce comparisons of varying validity across population-sensitive

versus non-population sensitive industries, with performance seemingly better on non-population

sensitive industries. We speculate that this is the case because the distinction between states in

non-population sensitive industries is far more stark than for population-sensitive industries, and

the CDB is a relatively approximate instrument that is only capable of discerning stark

differences. For instance oil production and mountain climbing vary considerable between

states, whereas hamburger consumption does not and the CDB is incapable of ascertaining the

more subtle distinction.

The reader may notice from the experiments that a number of important challenges remain

for the CDB, such as disambiguating multiple word senses, and determining whether the

statistics generated are indicative of consumer demand, or of market supply. We leave these

challenges for future work (see Section 8).

Attempting to provide quantitative assessments as to the magnitude of a market phenomenon

from qualitative text is not new. For instance, Romano et al [112] developed a qualitative data


Page 42 of 93

analysis methodology that successfully predicts box office opening success based on pre-release

free-form comments about the movie on the web. Their work showed promising results. Our

findings, above, concur, and affirm Romano et al’s contention that an appropriate methodology

for the analysis of free-form text can reveal meaningful evaluative information of market

phenomena.

7. APPLICATIONS

CDBs have a number of useful applications, including market research, sales lead prospecting,

competitor or substitute identification, or exploring unfamiliar collections of topics or items.

For market research, the experiments shown in the previous section indicate that CDBs can be a

plausible means of assessing industry penetration by state in a number of industries, as the quantitative

data for certain industries has been shown to correlate with the CDB rankings for descriptive terms for

that industry. While the assessments produced by the CDB are certainly flawed, they are nevertheless

demonstrably better than random, and therefore possess some information value. The CDB method

should be generally useful when one wants a ranking of categories in non-population sensitive

industries, and can tolerate some error, and independent data of good quality does not exist. While we

would not recommend that investments in product roll-out be fashioned directly around the CDBs

findings, we see the CDB as a useful exploratory tool that is able to suggest locations of interest for

further investigation or trials. We have employed the CDB in an engagement with a growing pet

insurance company, PetPlan USA ( http://www.gopetplan.com/ ). In our engagement, we compared the

locations of current customers of the company, with hit counts for ‘dog’ across all US locations, to

determine promising locations of future interest. This information, in conjunction with other


Page 43 of 93

intelligence gathered by the organization, is used to inform PetPlan’s marketing strategy. However, as

the CDB provides only an incidental contribution to the overall marketing decisions, it is not possible to

attribute specific dollar benefit to the information obtained from the CDB in this case.

In the area of sales lead prospecting, Du Pont corporation, a Fortune 500 chemical company, has

experimented with our CDB for the identification of prospective markets and customers for their

products. In one of their exploratory investigations, Du Pont made use of a taxonomy of industries, the

North American Industry Classification Systems (NAICS), and searched for hits for various attributes of

a chemical surfactant they manufacture across those industries. To assist the team of business

development managers and engineers with their investigations, we implemented web-based

collaboration features to allow users to capture and share their comments on particular industries that

showed high scores (see §4.6 and Figure 11). For instance a business analyst who comments “this

industry is a large market with few barriers to entry and should be investigated further” may receive a

response from a chemical engineer stating “this industry is unfortunately not feasible as the surfactant is

not food-safe”. Though we cannot attribute any specific new revenue to the CDB, there is anecdotal

evidence that the CDB uncovered industries of interest: trial users at Du Pont reported that the CDB

uncovered unusual industries they had not previously considered as potential markets. One trial user

also reported receiving an unexpected contact from a company in an industry identified by the CDB as

interesting.

For reasons of confidentiality, the following example is fabricated, but illustrative of the process that

can be followed to find new sales prospects. Assume that a salesperson has identified, using a CDB,

that, in comparison to other industries, documents from the plastics packaging industry mention the

attributes of the chemical that she is trying to sell with unusual frequency. The salesperson concludes

that companies in the plastics packaging industry may be interested in her compound. The salesperson


Page 44 of 93

is able to use the NAICS or SIC code for the plastics packaging industry to retrieve a list of potential

clients from a public source, such as the United States Securities and Exchange Commission (SEC)43:

Figure 16 shows a portion of the company listing she obtained in this way. The “Navigate” button in

Figure 16 allows the salesperson to select a prospect from the list and click the button to quickly

navigate to the company’s financial reports in order to further qualify the prospect. The salesperson has

successfully integrated knowledge gleaned from the CDB (unstructured data indicating that a certain

industry mentions her product with unusual frequency) with a structured data source (list of companies

in the identified candidate industry from the SEC), and has been rapidly able to identify a previously

unrealized lucrative target market, and construct a list of specific potential prospects.

In the area of competitor and substitute identification, we speculate that, when used in conjunction

with a taxonomy of industries or companies or products, CDBs can be used to identify particular

industries or companies or products that mention certain attributes with unusual frequency. We have

not, however, yet undertaken any academic or commercial trials in this sphere and are currently seeking

research partners to progress such studies.

In the area of exploring unfamiliar collections of topics or items, we speculate that the CDB may be

useful for uncovering topics or items with particular attributes amongst large set of unfamiliar topics or

items. For example, when populated with the most relevant pages for a list of hospitals, the CDB could

be helpful in identifying hospitals with particular specialties (e.g. ‘cardiology’). Similarly, when

populated with the top pages for a list of universities or schools, the CDB could conceivably identify

those with particular attributes (e.g. universities with a specialty in ‘chemical engineering’, or schools

that frequently mention students going on to ‘Ivy League colleges’). The CDB would be especially

useful if the aggregate data the CDB produced from text were combined – ‘mashed up’ – with

43 http://www.sec.gov/. Similar data, listing the companies in a given industry, could also have been obtained from commercial sources, such as Hoovers, Dun & Bradstreet, Microsoft Money, Yellow Pages, or other alternatives.


Page 45 of 93

structured data from other sources (see Section 4.5), to allow for multi-criterion decision-making. This

would, for example, allow a student to compare colleges offering ‘chemical engineering’ while

simultaneously looking at the annual fees and geographic locations for those colleges. Similarly, a

middle-school parent who would like to relocate nationally to a better school district for their child, may

be able to use a CDB to identify high schools reporting students going onto Ivy League colleges while

simultaneously looking at the median house price in the school’s neighborhood and the property taxes

for the county to assess affordability. Again, these applications are conjectured, and no exploratory

trials have been performed.

8. LIMITATIONS AND FUTURE WORK

As is evident from our earlier experimental evaluation (Section 6), the aggregate statistics obtained from

categorized text vary in their usefulness. This can be due to a number of factors, including ambiguous

terms, presence of negation and antonyms, alternative word forms, the tone of the text, intermingling of

information from unrelated themes or from multiple taxons in single documents, document duplication,

reporting biases, time and location specificity, relativity of reference points used for comparison (e.g.

40°F may be ‘warm’ for an Alaskan but not for a Floridian), human population biases, precision and

recall of the underlying search engine and relevance of its results, number of documents per taxon,

asymmetry in the number of child categories per parent taxon, disjoint phrases, distinguishing between

expressions of consumer demand and expressions of market supply, and other issues. Further

experiments are required to determine the influence of a number of possible alterations to our technique:

for example, using short snippets of text instead of full documents, only storing certain documents types

for each category (e.g. only news articles for that category or only encyclopedia articles for that

category), or using more or fewer documents per category. For reasons of space, we leave it to future


Page 46 of 93

work to comment in more detail on the CDB limitations described above, the mechanisms that can be

used to mitigate their influence, and the observed effects of alterations to our core technique.

9. SUMMARY

In this paper, we have illustrated an approach for populating and exploring Categorized Document

Bases (CBDs). CDBs represent a helpful middle ground between unstructured and structured data, since

the documents are well-organized (categorized), though not structured. The CDB is a rough tool,

capable of producing plausible comparisons of categories against each other only when the categories

are starkly different (as in the case of many non-population-sensitive industries).

When setting up the CDB, it is important that category descriptors are unambiguous and sufficient

highly relevant documents exist for even obscure categories in the classification scheme. The aggregate

statistics are independently useful, but can also be integrated with structured data for the categories – for

example, using bubble charts or tables, and using category identifiers to cross-reference from the

aggregate statistics for each document category to the traditional structured numeric data.

We assessed the reasonableness of our CDB approach through a number of experiments that

compared our aggregate results for each category to closely related numeric data, to determine whether

our proxy measures – aggregates derived from textual data – correlate at all with their quantitative

counterparts obtained from well-accepted structured sources. Our experiments seem to indicate that, for

a taxonomy such as the US GSA GLC list of US locations, where the CDB can be mechanically

populated with relevant content for each category, the CDB appears to produce a plausible reflection of

both natural and market phenomena in multiple industries, but only in industries where the locations

under comparison diverge substantially. The results, therefore, appear to partially support the


Page 47 of 93

hypothesis at the start of this paper that our CDBs can allow mountains of text information on locations

to be distilled into sensible comparisons of those locations.

We described some applications of our research, including market research, sales lead prospecting,

and rapidly obtaining insights into new collections of topics or items. We also briefly documented a

number of limitations we have found in the CDB population and exploration process – helpful repairs

and alterations, which improve the quality of the results are outside of the scope of this paper and will

be discussed in detail in future work.

In summary, we have described and evaluated a method for the creation and exploration of

Categorized Document Bases, and shown, through varied experiments, that our method can be useful.

Our experiments indicate that the CDB method should be generally useful when one wants a ranking of

categories in non-population sensitive industries, and can tolerate some error, and independent data of

good quality does not exist. It would appear that the CDB approach we have proposed presents a

promising new approach for the benefaction of additional value from textual documents, but much

further work is needed to refine the CDB construction and usage method presented.


Page 48 of 93

10. ACKNOWLEDGMENTS

Our thanks to the following research assistants, who assisted with the implementation and evaluation of the features and

algorithms described in the main text:

taxonomy importation scripts: Jason Gurwin, Debbie Chiou.

population routines: Joseph Leary, Ryan Mark Fleming, Shawn Zhou.

results visualization features: David Gorski, Ryan Namdar, Adam Altman, Ankit Choudari, Michael Pan, Myron

Robinson, Mark Weinberger.

experimental evaluation: Shawn Zhou, Ava Zhiyang Yang, Aditya Mehrotra, Peng Chen, Anjay Kumar, Anjay

Aushij, Erik Malmgren-Samuel.

Thanks are also due to:

John Ranieri and Ray Miller, of Du Pont Corporation ( http://www.dupont.com/ ), for championing CDB

experiments within Du Pont, and providing feedback on the results.

Chris and Natasha Ashton, of PetPlan USA pet insurance ( http://www.gopetplan.com/ ), for the provision of

PetPlan’s dog health insurance sales data.

The reviewers, whose suggestions were very valuable in improving the content of this paper.


Page 49 of 93

http://www.gopetplan.com/

http://www.dupont.com/

11. REFERENCES


Page 50 of 93

1. Apte C.; Damerau F.; and Weiss S. Automated learning of decision rules for text categorization. ACM

Transactions on Information Systems, 12, 3 (July 1994), 233-240.

2. Apte C.; Damerau F.; and Weiss S. Text mining with decision trees and decision rules. In, Conference on

Automated Learning and Discovery, Pittsburgh, PA, June, 1998, pp.1-4.

3. Agirre E., and Edmonds P. (eds.) Word Sense Disambiguation: Algorithms and Applications. Dordrecht: Springer,

2007.

4. Attardi G.; Gulli A.; and Sebastiani F. Automatic web page categorization by link and content analysis. In,

Hutchinson C., and Lanzarone G. (eds.), Proceedings of the European Symposium on Telematics, Hypermedia, and

Artificial Intelligence (THAI-99), 1999, pp.105-119.

5. Allen RB.; Obry P.; and Littman M. An interface for navigating clustered document sets returned by queries. In,

Proceedings of the Conference on Organizational Computing Systems, Milpitas, CA, November 1-4, 1993, pp.166-

171.

6. Behal A , Chen Y , Kieliszewski1 C, Lelescu A, He B, Cui J, Kreulen J, Rhodes J, Spangler WS. Business Insights

Workbench – An Interactive Insights Discovery Solution. Lecture Notes in Computer Science Volume 4558.

2007. Pp. 834-843.

7. Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh

International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.

8. Borko H., and Bernick M. Automatic document classification. Journal of the ACM, 10, 2, (April 1963), 151-162.

9. Borko H., and Bernick M. Automatic document classification part II: additional experiments. Journal of the

ACM, 11, 2, (April 1964), 138-151.

10. Bhandarkar A.; Chandrasekar R.; Ramani S.; and Bhatnagar A. Intelligent categorization, archival and retrieval of

information. In, Proceedings of the International Conference on Knowledge Based Computer Systems (KBCS ’89),

Bombay, India, December 11-13, 1989, Lecture Notes in Computer Science, 444, Springer, 1990, pp. 309-320.

11. Blair D.C. Searching biases in large interactive document retrieval systems. Journal of the American Society for

Information Science, 31, (July 1980), 271-277.

12. Bonferroni, C. E. “Il calcolo delle assicurazioni su gruppi di teste.” In Studi in Onore del Professore Salvatore


Page 51 of 93

Ortu Carboni. Rome: Italy, pp. 13-60, 1935.


Page 52 of 93

13. Bonferroni, C. E. “Teoria statistica delle classi e calcolo delle probabilità.” Pubblicazioni del R Istituto Superiore

di Scienze Economiche e Commerciali di Firenze 8, 3-62, 1936.

14. Butler D. Mashups mix data into global service. Nature, 439, (4 January 2006), 6-7.

15. Bot R.S.; Wu Y.B.; Chen X.; and Li Q. A hybrid classifier approach for web retrieved documents classification.

In, International Conference on Information Technology: Coding and Computing (ITCC), 2004, pp. 326-330.

16. Bot R.S.; Wu Y.B.; Chen X.; and Li Q. Generating better concept hierarchies using automatic document

classification. In, Conference on Information and Knowledge Management, 2005, pp. 281-282.

17. Chen H., and Dumais S. Bringing order to the web: automatically categorizing search results. In, Proceedings of

the SIGCHI Conference on Human Factors in Computing Systems. The Hague, The Netherlands. April 1-6, 2000,

pp. 145-152.

18. Chim H., and Deng X. A new suffix tree similarity measure for document clustering. In, 16th International World

Wide Web Conference (WWW2007), Banff, Alberta, Canada, May 8-12, 2007, pp. 121-130.

19. Chen H., and Ho T.K. Evaluation of decision forests on text categorization. In, Proceedings of the 7th SPIE

Conference on Document Recognition and Retrieval, 2000, pp. 191-199.

20. Cutting D.R.; Karger D.R.; Pedersen J.O.; and Tukey J.W. Scatter/Gather: a cluster-based approach to browsing

large document collections. In, Proceedings of the 15th annual international ACM SIGIR conference on Research

and development in information retrieval. Copenhagen, Denmark, June 21-24, 1992, pp. 318-329.

21. Calvo R.A.; Lee J.M.; and Li X. Managing content with automatic document classification. Journal of Digital

Information, 5, 2, (2004), 1-15.

22. Chen M.; LaPaugh A.; and Singh J.P. Categorizing information objects from user access patterns. In, Proceedings

of the Eleventh International Conference on Information and Knowledge Management, 4-9 November 2002, pp.

365-372.

23. Croft W.B. Clustering large files of documents using the single link method. Journal of the American Society of

Information Science, 28, (1977), 341-344.

24. Croft W.B. Organizing and Searching Large Files of Documents. PhD thesis, University of Cambridge. (1978).

25. Cohen W.W., and Singer Y. Context-sensitive learning methods for text categorization. ACM Transactions on


Page 53 of 93

Information Systems, 17, 2, (1999), 141-173.


Page 54 of 93

26. Chen H.; Schuffels C.; and Orwig R. Internet categorization and search: a self-organizing approach. Journal of

Visual Communication and Image Representation, Special Issue on Digital Libraries, 7, 1, (1996), 88-102.

27. Chau R.; Yeh C.; and Smith K.A. A neural network model for hierarchical multilingual text categorization.

Advances in Neural Networks – ISNN 2005, Lecture Notes in Computer Science, 3497, (2005), 238-245.

28. Chung W, Chen H. and Nunamaker J. A Visual Framework for Knowledge Discovery on the Web: An Empirical

Study of Business Intelligence Exploration. Journal of Management Information Systems. 21(4). Spring 2005.

Pp. 57 – 84.

29. Dumais S., and Chen H. Hierarchical classification of web content. In, Proceedings of the 23rd Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval . Athens, Greece,

July 24-28, 2000, pp. 256-263.

30. Dumais S.; Cutrell E.; and Chen H. Optimizing search by showing results in context. In, Proceedings of the

SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA, March 2001, pp. 277-284.

31. Dworman G.O.; Kimbrough S.O.; and Patch C. On pattern-directed search of archives and collections. Journal of

the American Society for Information Science, 51, 1, (2000), 14-23.

32. Dagan I.; Karov Y.; and Roth D. Mistake-driven learning in text categorization. In, Cardie C. and Weischedel R.

(eds.), Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, pp.

55-63.

33. Datta A., and Thomas H. The cube data model: a conceptual model and algebra for on-line analytical processing

in data warehouses. Decision Support Systems, 27, 3, (1999), 289-301.

34. Eklund P.W., and Cole R.J. Information classification and retrieval using concept lattices. United States Patent

20060112108.

35. Eder J.; Krumpholz A.; Biliris A.; and Panagos E. Self-maintained folder hierarchies as document repositories. In,

Proceedings of the 2000 Kyoto International Conference on Digital Libraries: Research and Practice , Kyoto,

Japan, November 13-16, 2000, pp. 356-363.

36. Efthimiadis E.N. Query expansion. In, Martha E. Williams (ed.), Annual Review of Information Systems and

Technology (ARIST), 31, 1996, pp. 121-187.

37. Farkas J. Generating document clusters using thesauri and neural networks. In, Canadian Conference on


Page 55 of 93

Electrical and Computer Engineering, 1994, pp. 710-713.


Page 56 of 93

38. Fellbaum C. (ed.). WordNet: An electronic lexical database. Cambridge, Massachusetts: Bradford Books / MIT

Press, 1998.

39. Ferrari AJ, Gourley DJ. Johnson KA, Knabe FC, Mohta VB, Tunkelang D, and Walter JS. Hierarchical data-

driven search and navigation system and method for information retrieval. US Patent 7062483. June 13, 2006.

40. Fürnkranz J. Exploiting structural information for text classification on the WWW. In, Proceedings of the Third

International Symposium on Advances in Intelligent Data Analysis, August 1, 1997, pp. 487-498.

41. Geffner S.; Agrawal D.; El Abbadi A.; and Smith T. Browsing large digital library collections using classification

hierarchies. In, Proceedings of the Eighth International Conference on Information and Knowledge Management,

Kansas City, MO, November 2-6, 1999, pp. 195-201.

42. Gray J.; Chaudhuri S.; Bosworth A.; Layman A.; Reichart D.; and Venkatrao M.. Data cube: a relational

aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1,

1, (1997), 29-53.

43. Gietz P. Report on automatic classification systems for the TERENA activity Portal Coordination. (19 June

2001). Available at: www.daasi.de/reports/Report-automatic-classification.html

44. Goren-Bar D., and Kuflik T. Supporting user-subjective categorization with self-organizing maps and learning

vector quantization. Journal of the American Society for Information Science and Technology, 56, 4, (2005), 345-

355.

45. Goren-Bar D.; Kuflik T.; and Lev D. Supervised learning for automatic classification of documents using self-

organizing maps. In, Proceedings of the First DELOS Network of Excellence Workshop on “Information Seeking,

Searching and Querying in Digital Libraries”, Zurich, Switzerland, December 11-12, 2000.

46. Garfield E.; Malin MV.; and Small H. A system for automatic classification of scientific literature. Journal of the

Indian Institute of Science, 57, 2, (1975), 61-74.

47. Golub K. Automated subject classification of web documents. Journal of Documentation, 62, 3, (2006), 350-371.

48. Godby J., and Stuler J. The Library of Congress Classification as a knowledge base for automatic subject

categorization. In, Subject Retrieval in a Networked Environment (IFLA Preconference), Dublin, Ohio, August

2001.

49. Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. KNN model-based approach in classification. In, Proceedings of


Page 57 of 93

http://www.daasi.de/reports/Report-automatic-classification.html

the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE 2003), Catania,

Sicily, Italy, 3-7 November 2003. Lecture Notes in Computer Science, 2888, Springer-Verlag, 2003, pp. 986-996


Page 58 of 93

50. Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. An kNN model-based approach and its application in text

categorization. In, Proceedings of the 5th International Conference on Intelligent Text Processing and

Computational Linguistics (CICLing),2004, Lecture Notes in Computer Science, 2945, Springer-Verlag, pp. 559-

570.

51. Hearst, M. Clustering versus Faceted Categories for Information Exploration. Communications of the ACM. 49

(4). April 2006.

52. Hearst M, English J, Sinha R, Swearingen K, and Yee P. Finding the Flow in Web Site Search. Communications

of the ACM. 45 (9). September 2002. Pp.42-49.

53. Holland JM, Kreulen JT, and Spangler WS. Method and system for identifying relationships between text

documents and structured variables pertaining to the text documents. US Patent 7155668. December 26, 2006.

54. Huffman S., and Damashek M. Acquaintance: a novel vector-space n-gram technique for document categorization.

In, Proceedings of the 3rd Text Retrieval Conference (TREC 3), 1994, pp. 305-310.

55. Hatzivassiloglou V.; Gravano L.; and Maganti A. An investigation of linguistic features and clustering algorithms

for topical document clustering. In, Proceedings of the 23rd Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, Athens, Greece, 2000, pp. 224-231.

56. Hussin M.F., and Kamel M. Document clustering using hierarchical SOMART neural network. In, Proceedings of

the International Joint Conference on Neural Networks, 20-24 July 2003, pp. 2238 – 2242.

57. Hayes P.; Knecht L.E.; and Cellio M.J. A news story categorization system. In, Second Conference on Applied

Natural Language Processing (ANLP-88), 1988, pp. 9-17. Reprinted in Sparck-Jones K., and Willett P. (eds.),

Readings in Iinformation Retrieval, San Francisco, CA: Morgan Kaufmann, 1997, pp. 518-526.

58. Han E.H.; Karypis G.; and Kumar V. Text categorization using weight adjusted k-nearest neighbor classification.

In, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001, pp. 53-65.

59. Hofmann T. The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In,

Proceedings of the International Joint Conference in Artificial Intelligence, 1999, pp. 682 – 687.

60. Huynh D, Mazzochi S, and Karger D. Piggy Bank: Experience the Semantic Web Inside your Web Browser.

Journal of Web Semantics. 5(1). 2007. Pp. 16-27.

61. Iwayama M., and Tokunaga T. Hierarchical Bayesian clustering for automatic text classification. In, Proceedings


Page 59 of 93

of the International Joint Conference in Artificial Intelligence, 1995, pp. 1322-1327.


Page 60 of 93

62. Ide N. and Veronis J. Word sense disambiguation: the state of the art. Computational Linguistics, 24, 1, (1998), 1-

40.

63. Jenkins C.; Jackson M.; Burden P.; and Wallis J. Automatic classification of web resources using Java and Dewey

decimal classification. Computer Networks and ISDN Systems, 30, (1998), 646-648.

64. Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In,

Proceedings of the Fourteenth International Conference on Machine Learning, July 8-12, 1997, pp. 143-151.

65. Joachims T. Text categorization with Support Vector Machines: Learning with many relevant features. In,

Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998, pp. 137-142.

66. Jardine N., and van Rijsbergen C.J. The user of hierarchical clustering in information retrieval. Information

Storage and Retrieval, 7, (1971), 217-240.

67. Joachims T., and Sebastiani F. (eds.). Automated text categorization (special issue), Journal of Intelligent

Information Systems, 18, (March-May 2002), 2-3.

68. Käki M. Findex: search result categories help users when document ranking fails. In, Proceedings of the SIGCHI

Conference on Human Factors in Computing Systems, Portland, OR, April 2-7, 2005, pp. 131-140.

69. Ko S.J.; Choi J.H.; and Lee J.H. Bayesian web document classification through optimizing association word. In,

Proceedings of the 15th International Conference in Applied Artificial Intelligence, Laughborough, UK, Lecture

Notes in Computer Science, 2718, (2003), 565-574.

70. Koch T.; Day M.; Brümmer A.; Hiom D.; Peereboom M.; Poulter A.; and Worsfold E. The role of classification

schemes in internet resource description and discovery. In, Work Package 3 of Telematics for Research project

Development of a European Service for Information on Research and Education (DESIRE) (RE 1004), 1999.

71. Kendall M. A New Measure of Rank Correlation. Biometrika, 30, (1938), 81-89.

72. Kendall M. Rank Correlation Methods. London: Charles Griffin & Company Limited, 1948.

73. Knepper M.M.; Fox K.L.; and Frieder O. Method for domain identification of documents in a document database.

United States Patent 20060206483.

74. Kules B.; Kustanowitz J.; and Shneiderman B. Categorizing web search results into meaningful and stable

categories using fast-feature techniques. In, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital


Page 61 of 93

Libraries, Chapel Hill, NC, June 11-15, 2006, pp. 210-219.


Page 62 of 93

75. Koller D., and Sahami M. Rule-based hierarchical document categorization for the World Wide Web. In,

Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 170-178.

76. Kita K.; Sasaki M.; and Ying T.X. Rule-based hierarchical document categorization for the World Wide Web. In,

Asia Pacific Web Conference, 1998, pp. 269-273.

77. Kummamuru K.; Lotlikar R.; Roy S.; Singal K.; and Krishnapuram R. A hierarchical monothetic document

clustering algorithm for summarization and browsing search results. In, Proceedings of the 13th international

conference on World Wide Web, New York, NY, May 17-20, 2004, pp. 658-665.

78. Li X., and Calvo R.A. Hierarchical document classification using I bayes. In, 8th Australasian Document

Computing Symposium, CSIRO, Canberra, December 2003.

79. Leouski A.V., and Croft W.B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-

76, Amherst: Department of Computer Science, University of Massachusetts, 1996.

80. Larkey L., and Croft W.B. Combining classifiers in text categorization. In, Proceedings of the 19th International

Conference on Research and Development in Information Retrieval, 1996, pp. 289-297.

81. Li Q.; Chen X.; Bot R.S.; and Wu Y.B. Improving concept hierarchy development for web returned documents

using automatic classification. In, International Conference on Internet Computing, 2005, pp. 99-105.

82. Labrou Y., and Finin T. Yahoo! As an ontology: using Yahoo! Categories to describe documents. In, Proceedings

of the 8th International Conference on Information and Knowledge Management (CIKM-99), Kansas City, MO,

November 2-6, 1999, pp. 180-187.

83. Lewis D.D., and Gale W.A. A sequential algorithm for training text classifiers. In, Croft W.B. and van Rijsbergen

C.J. (eds.), Proceedings of the 17th ACM International Conference on Research and Development in Information

Retrieval (SIGIR ’94), Dublin, Ireland, 1994, pp. 3-12.

84. Lam W., and Ho C.Y. Using a generalized instance set for automatic text categorization. In, Proceedings of the

21st International Conference on Research and Development in Information Retrieval (SIGIR ’98), Melbourne,

Australia, 1998, pp. 81-89.

85. Liang J.Z. SVM multi-classifier and Web document classification. In, Proceedings of the 2004 International

Conference on Machine Learning and Cybernetics, Volume 3, August 26-29, 2004, pp. 1347-1351.

86. Li Y.H., and Jain A.K. Classification of text documents. The Computer Journal, 41, 8, (1998), 537-546.


Page 63 of 93


Page 64 of 93

87. Li Y., and Lan Z. A survey of load balancing in grid computing. Lecture Notes in Computer Science, 3314,

(2005), 280-285.

88. Li W.; Lee B.; Krausz F.; and Sahin K. Text classification by a neural network. In, Proceedings of the. 23rd

Annual Summer Computer Simulation Conference, Baltimore, MD July 22-24, 1991, pp. 313-318.

89. Lewis D.D., and Ringuette M. A comparison of two learning algorithms for text categorization. In, Third Annual

Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.

90. Lehnert W.S.; Soderland S.; Aronow D.; Feng F.; and Shmueli A. Inductive text classification for medical

applications. Journal for Experimental and Theoretical Artificial Intelligence, 7, 1, (1995), 49–80.

91. Lewis D.D.; Schapire R.E.; Callan J.P.; and Papka R. Training algorithms for linear text classifiers. In,

Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in

Information Retrieval, Zurich, Switzerland, August 18-22, 1996, pp. 298-306.

92. Lin S.H.; Shih C.S.; Chen M.C.; Ho J.M.; Ko M.T.; and Huang Y.M. Extracting classification knowledge of

internet documents with mining term associations: a semantic approach. In, Proceedings of the 21st Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval , Melbourne,

Australia, August 24-28, 1998, pp. 241-249.

93. Lewis D.D.; Yang Y.; Rose T.G.; and Li F. RCV1: A new benchmark collection for text categorization research.

Journal of Machine Learning Research, 5, (2004), 361-397.

94. MacLeod K. An application specific neural model for document clustering. In, Proceedings of the Fourth Annual

Parallel Processing Symposium, 1, 1990, pp. 5-16.

95. Marchionni G. Exploratory search: from finding to understanding. Communications of the ACM. 49 (4). April

2006. Pp 41-46.

96. McGarry K. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review,

20, 1, (March 2005), 39-61.

97. Möller G.; Carstensen K.U.; Diekman B.; and Watjen H. Automatic classification of the World Wide Web using

Universal Decimal Classification. In, McKenna B (ed.), 23rd International Online Information Meeting(London,

England), Oxford: Learned Information Europe, 1999, pp. 231-237.

98. Markov A., and Last M. A simple, structure-sensitive approach for web document classification. Advances in


Page 65 of 93

Web Intelligence. Lecture Notes in Computer Science, 3528, (2005), 293-298.


Page 66 of 93

99. Maarek Y.S., and Wecker A.J. The Librarian’s Assistant: automatically organizing on-line books into dynamic

bookshelves. In, Proceedings of Intelligent Multimedia Information Retrieval Systems and Management (RIAO

’94), New York, NY, October 11-13, 1994.

100. Nigam K.; McCallum A.; Thrun S.; and Mitchell T. Text classification from labeled and unlabeled documents

using EM. Machine Learning, 39,2/3, (2000), 103-134.

101. Papka, R., and Allan J. Document classification using multiword features. In, Proceedings of the 7th International

Conference on Information and Knowledge Management, Bethesda, MD, 1998, pp. 124-131.

102. Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia.

Philosophical Transactions of the Royal Society of London, 187, (1896), 253-318.

103. Pierre J.M. On the automated classification of web sites. Linkoping Electronic Articles in Computer and

Information Science, 6, 1, (2001), 1-12.

104. Pollitt S. The key role of classification and indexing in view-based searching. 63rd IFLA General Conference and

Council. International Federation of Library Associations and Institutions (IFLA). Copenhagen, Denmark. 31

August – 5 September 1997.

105. van Rijsbergen C.J. Information Retrieval, 2nd Edition. London: Butterworths, 1979.

106. Riloff E. Little words can make a big difference for text classification. In, Proceedings of the 18th Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval , Seattle, WA, July

9-13, 1995, pp. 130-136.

107. Riloff E. Using learned extraction patterns for text classification. In, Wermter S., Riloff E., and Scheler G. (eds),

Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Berlin:

Springer-Verlag, 1996, pp. 275-289.

108. Riloff E., and Lehnert W. Classifying texts using relevancy signatures. In, Proceedings of the Tenth National

Conference on Artificial Intelligence, 1992, pp. 329-334.

109. Riloff E., and Lehnert W. Information extraction as a basis for high-precision text classification. ACM

Transactions on Information Systems, 12, 3, (July 1994), 296-333.

110. Rocchio J.J. Relevance feedback in information retrieval. In, Salton G. (ed.), The SMART Retrieval System:


Page 67 of 93

Experiments in Automatic Document Processing, Englewood Cliffs, NJ: Prentice-Hall, 1971, pp. 313-323.


Page 68 of 93

111. Rodden K. About 23 million documents match your query… In, ACM Conference on Human Factors in

Computing Systems (ACM CHI’98), Los Angeles, CA, April 1998, pp. 64-65.

112. Romano NC, Donovan C, Chen H, and Nunamaker J. A Methodology for Analyzing Web-Based Qualitative Data.

Journal of Management Information Systems. 19(4). Spring 2003. Pp. 213 – 246.

113. Ruiz M., and Srinivasan P. Hierarchical text categorization using neural networks. Information Retrieval, 5, 1,

(2002), 87-118.

114. Schraefel MMC, Wilson M, Russell A, and Smith DA: mSpace: improving information access to multimedia

domains with multimodal exploratory search. Communications of the ACM. 49(4). April 2006. Pp. 47-49.

115. Siegel S., and Castellan N.J. Nonparametric Statistics for the Behavioral Sciences, 2nd edition. London: McGraw-

Hill, 1988.

116. Schutze H. Automatic word sense discrimination. Computational Linguistics, 24, 1, (1998), 97-123.

117. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 34, 1, (2002), 1-47.

118. Sebastiani F. Text categorization. In, Alessandro Zanasi (ed.), Text Mining and its Applications to Intelligence,

CRM and Knowledge Management, Southampton, UK : WIT Press, , 2005, pp. 109-129.

119. Shirazi B.A.; Kavi K.M.; and Hurson A. Scheduling and Load Balancing in Parallel and Distributed Systems.

Los Alamitos, CA: IEEE Computer Society Press, 1995.

120. Shafer K.E. Scorpion helps catalog the Web. Bulletin of the American Society for Information Science, 24, 1,

(October/November 1997), 28-29.

121. Spangler S and Kreulen J. Mining the Talk: Unlocking the Business Value in Unstructured Information. IBM

Press. 2008.

122. Spangler S, Kreulen JT, and Lessler J. Generating and Browsing Multiple Taxonomies Over a Document

Collection. Journal of Management Information Systems. 19(4). Spring 2003. Pp. 191 – 212

123. Spearman C. The proof and measurement of association between two things. American Journal of Psychology, 15,

(1904), 72–101. Reprinted in: The American Journal of Psychology, 100, ¾, Special Centennial Issue, (Autumn –

Winter, 1987), 441-471.

124. Slonim M., and Tishby N. The power of word clusters for text classification. In, Proceedings of ECIR-01, 23rd


Page 69 of 93

European Colloquium on Information Retrieval Research, Darmstadt, Germany, 2001, pp. 1-12.


Page 70 of 93

125. Sun A.; Lim E.P.; and Ng W.K. Hierarchical text classification and evaluation. In, IEEE International

Conference on Data Mining (ICDM), San Jose, CA, Nov 29-Dec 2, 2001, pp. 521-528.

126. Svingen B. Using genetic programming for document classification. In, Proceedings of the 11th International

Florida Artificial Intelligence Research Society Conference (FLAIRS98), 1998, pp. 63-67.

127. Toth E. Innovative solutions in automatic classification: a brief summary. Libri, 52, 1, (2002), 48-53.

128. Thompson R.; Shafer K.E.; and Vizine-Goetz D. Evaluating Dewey Concepts as a Knowledge Base for Automatic

Subject Assignment. In, 2nd ACM International Conference on Digital Libraries, Philadelphia, PA, 1997, pp. 37-

46.

129. Vlajic N., and Card H.C. Categorizing Web pages using modified ART. In, Canadian Conference on Electrical

and Computer Engineering, Volume 1, 1998, pp. 313-316.

130. Wang Y.; Hodges J.; and Tang B. Classification of Web documents using a naïve Bayes method. In, Proceedings

of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2003, pp. 560.

131. Wei C-P, Chiang RHL, and Wu CC. Accommodating Individual Preferences in the Categorization of Documents:

A Personalized Clustering Approach. Journal of Management Information Systems. 23(2). Fall 2006. pp. 173 –

201.

132. Wei C-P, Hu PJ, and Le Y-H. Preserving User Preferences in Automated Document-Category Management: An

Evolution-Based Approach. Journal of Management Information Systems. 25(4). Spring 2009. pp. 109 – 143.

133. Willet P. Recent trends in hierarchic document clustering: A critical review. Information Processing and

Management, 24, 5, (1988), 577-597.

134. Worsfold E. Subject gateways – fulfilling the DESIRE for knowledge. Computer Networks and ISDN Systems, 30,

16, (30 September 1998), 1479-1489.

135. Wiener E.D.; Pedersen J.O.; and Weigend A.S. A neural network approach to topic spotting. In, Proceedings of

4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-95), Las Vegas, NV, 1995, pp.

317-332.

136. Wu Y.B.; Shankar L.; and Chen X. Finding more useful information faster from web search results. In,

Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, LA,


Page 71 of 93

2003, pp. 568–571.

137. Yang Y. An evaluation of statistical approaches for text categorization. Journal of Information Retrieval, 1, 1-2,

(1999) 67-88.

138. Yang Y., and Liu X. A re-examination of text categorization methods. In, Hearst M.A., Gey F., and Tong R. (eds.),

Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information

Retrieval, Berkeley, CA: ACM Press, 1999, pp. 42-49.

139. Yang Y.; Slattery S.; and Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent

Information Systems, 18, 2-3, (March, 2002), 219-241.

140. Zamir O., and Etzioni O. Web document clustering: a feasibility demonstration. In, Proceedings of the 21st Annual

International ACM SIGIR Conference on Research and Development in Information Retrieval , Melbourne,

Australia, August 24-28, 1998, pp. 46–54.

141. Zamir O., and Etzioni O. Grouper: a dynamic clustering interface to Web search results. Proceeding of the Eighth

International Conference on World Wide Web, Toronto, Canada, May 1999, pp. 1361-1374.

142. Zamir O.; Korn J.; Fikes A.; and Lawrence S. Personalization of placed content ordering in search results. United

States Patent Application 0050250580. Patent ID EP 1782286A1. Issued May 9, 2007.

143. Zeng H.J.; He Q.C.; Chen Z.; Ma W.Y.; and Ma J. Learning to cluster web search results. In, Proceedings of the

27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,

Sheffield, UK, July 25-29, 2004, pp. 210-217.


Page 72 of 93

FIGURES

Figure 1: Results produced by exploratory search on ‘fishing’ using clusty.com


Page 73 of 93

Figure 2: Results produced by exploratory search on ‘fishing’ using kartoo.com


Page 74 of 93

Figure 3: Results produced by exploratory search on ‘US locations fishing’ using

clusty.com


Page 75 of 93

Figure 4: Hits for ‘fishing’ by state, from a CDB populated with American cities


Page 76 of 93

Figure 5: Population of categories with relevant documents


Page 77 of 93

Figure 6: Screen from our software prototype, showing word suggestion tool to allow user

to select an expansion or disambiguation of the current search term


Page 78 of 93

Figure 7: Computing hits for each category


Page 79 of 93

Figure 8: Relative prevalence of the terms “smoothness”, “strength” and [“wet” or

“damp”] in various stone quarrying industry segments. Three bars are shown for each

industry: from left to right, the three bars for that industry are “smoothness” for that

industry, “strength” for that industry, and [“wet OR damp”] for that industry.


Page 80 of 93

Figure 9: Excel-based collapsible tree view provided for the exploration of

aggregate statistics (e.g. hits) per category in various taxonomies


Page 81 of 93

Figure 10: Web-based collapsible tree view provided for the exploration of

aggregate statistics (e.g. hits) per category, in the UNSPSC taxonomy


Page 82 of 93

Figure 11: Collaborative annotation interface for sharing of human observations on interesting

categories amongst a team: illustration of users sharing comments on possible applications of a

biodegradable molecule with foam reduction properties.


Page 83 of 93

CategoriesColumn Name Data TypeParentID Int(11) (Primary Key)CategoryID Int(11)CategoryName Char(255)Flag Int(11)DateCompleted DateTime

CategoryAssignmentColumn Name Data TypeCategoryID Int(11) (Primary Key)DateAndTimeAssigned TimestampIPAddressOfServant Char(255)ServantComputerName Char(255)

CategoryCompletedColumn Name Data TypeCategoryID Int(11) (Primary Key)DateAndTimeCompleted TimestampIPAddressOfServant Char(255)ServantComputerName Char(255)

DocumentsColumn Name Data TypeDocumentID Int(11) (Primary Key)DocumentURL TextCategoryID Int(11)DateCompleted DateTime

LexiconColumn Name Data TypeWordSenseID Int(11) (Primary Key)WordText Char(255)

WordsColumn Name Data TypeDocumentID Int(11) (Primary Key)WordSenseID Int(11)WordPosition Int(11)

Figure 12: Database tables used to represent and index documents in the CDB


Page 84 of 93

RequestColumn Name Data TypeRequestID Int(11) (Primary Key)UserID Char(255)TopCategoryID Int(11)SearchPhrase Char(255)DateAndTimeRequested TimeStampDateAndTimeAssigned DateTimeDateAndTimePopulated DateTime

RequestedDataColumn Name Data TypeRequestID Int(11) (Primary Key)CategoryID Int(11)SearchPhrase Char(255)WordCount Int(11)DocumentCount Int(11)Servant Char(255)DateAndTime DateTime

Figure 13: Database tables used to store and cache query results from the CDB


Page 85 of 93

UserColumn Name Data TypeUserID Int(11) (Primary Key)Username Char(255)Company Char(255)Email Char(255)

ProjectColumn Name Data TypeProjectID Int(11) (Primary Key)Project Name Char(255)CreatedByUserID Int(11)

AnnotationColumn Name Data TypeAnnotation_ID Int(11) (Primary Key)Category_ID Int(11)Project_ID Int(11)CreatedByUserID Int(11)Application Varchar(500)Notes Varchar(600)Review Varchar(60)Rank Int(11)DateAndTime DateTime

SharedAnnotationColumn Name Data TypeAnnotationID Int(11) (Primary Key)SharedWithUserID Int(11)

Figure 14: Database tables used for collaborative annotation features of the CDB


Page 86 of 93

Figure 15: Bubble Chart Showing Integration of

Aggregate Data Gleaned From Text, with Structured Data.

X-Axis represents the Asset Turnover for the industry (i.e. category).

Y-Axis is relative prevalence of the search term “biodegradable” in each category.


Page 87 of 93

.

Figure 16: Sample of publicly traded companies in the plastics packaging industry,

obtained from the SEC, after a CDB exploration revealed that documents in the plastics

packaging industry frequently mention a compound being marketed by a chemical

industry salesperson.


Page 88 of 93

TABLES

Type of Taxonomy Examples

Product hierarchies United Nations Standard Products and Services Code (UNSPSC) *

United Nations Central Product Classification (CPC) United States Patent Classification (USPTO USPC) *

International Patent Classification (IPC) Proprietary corporate product catalogues (e.g. Amazon, Wal-Mart, Sears, or any other

catalogue defined by any large or small company)

Industry taxonomies North American Industry Classification Scheme (NAICS) †

United States Standard Industrial Classifications (SIC) †

International Standard Industrial Classification (ISIC) SITC3 (Standard International Trade Classification)

Company classifications

Fortune 500 † and Fortune 1000 S&P 500 †

Inc. 500 † and Inc. 5000 Entrepreneur Magazine’s Franchise 500 Internet Retailer 500

Activity taxonomies WordNet (Verb relationships) †

United States Bureau of Labor Statistics Standard Occupation Classification System (SOC) †

Place (Location) taxonomies

United States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic Names Service *

United States Direct Marketing Areas (DMA) †

Getty Thesaurus of Geographic Names (TGN)

Time taxonomies ISO †

Topic taxonomies Library of Congress Classification system (LoC) †

UK Joint Academic Coding System (JACS) †

UK Higher Education Standard Authority Coding (HESACODE) †

Medical taxonomies International Classification of Diagnoses (e.g. ICD10) †

International Classification of Primary Care (ICPC) Current Procedural Terminology (CPT) US FDA Classification of Medical Devices †

Table 1: Popular Classification Schemes† indicates the taxonomy (category names and relations) was imported into our prototype system,

and a random selection of approximately 10% of categories were populated with documents* indicates that the taxonomy was imported into our prototype system and all categories were populated with documents


Page 89 of 93

Table 2: Absolute Hits for a Number of Search Terms, by Document Category


Page 90 of 93

Type of Taxonomy Examples of data indexed using standard taxonomies

Product hierarchies Sales data for each product category, from an internal company database, indexed by product

category (e.g. UNSPSC, or UCC Stock Keeping Unit [SKU]).

Industry taxonomies Industry size figures from the Bureau of Economic Analysis (BEA.gov), or from the Internal

Revenue Service (IRS.gov), indexed by NAICS code.

Company classifications Company profit figures, from the Securities and Exchange Commission (SEC.gov), indexed

by NAICS code.

Activity / Employee

taxonomies

Salary data for each profession, from the Bureau of Labor Statistics (BLS.gov), indexed by

SOC occupation classification.

Place (Location)

taxonomies

Population, land area, and other geographic data from the United States Geological Survey

(USGS.gov), indexed by Geographic Locator Code (GLC).

United States General Services Administration Geographic Locator Codes (US GSA GLC).

Time taxonomies Sales data for each date, from an internal company database, indexed by time.

Topic taxonomies Enrollment data for each academic subject, from the National Center for Education Statistics

(NCES.ed.gov), indexed by educational field.

Medical taxonomies Infection rate, for each illness, in each area, indexed by ICD9 or ICD10 disease code.

Table 3: Structured Data Sources for Various Taxonomies


Page 91 of 93

Corporate Franchise

Total US Franchise

OutletsCDB Search Terms Used

Pearson’s : Per-capita franchise outlets per

state vs CDB search term frequency for state 44

Population Correlation 45

McDonalds 11,318“burger” -0.25

0.98“hamburgers” -0.15

Pizza Hut 5,676 “pizza” 0.04 0.34

KFC 4,378 “chicken” -0.17 0.95

Intercontinental 3,023 “hotels” -0.34 0.93

Starbucks 9,869 “coffee” -0.21 0.90

RE/MAX 4,628 “property” 0.15 0.95

Supercuts 1,644 “hair” -0.35 0.79

Jackson Hewitt 2,475 “tax” 0.10 0.87

Carlson Wagonlit 340“travel” 0.26

0.87“flight” 0.13

Jiffy Lube 1,923 “car” -0.10 0.82

Miracle Ear 1,349 “hearing” 0.02 0.89

Table 4: Summary of Experimental Results – Selected Population-Sensitive Industries

44 For each study shown in Table 5, fifty one (51) pairs of data points – one pair of rankings for each U.S. state, including the District of Columbia – were compared. For 51 data points (the 51 states compared in each case), the Pearson score required for statistical significance at the weaker 90% confidence level is > 0.231, and at the stronger 99% confidence level the Pearson score threshold required is > 0.354. Therefore, in all cases where > 0.354 we can conclude that it is highly unlikely that the correlation between the two rankings being compared occurred by chance.

45 ‘Population Correlation’ is the Pearson -value found when correlating the ranking for the states by the US Franchise outlets in that industry with the ranking of the states by their population. Population figures were obtained from the US Census Bureau.


Page 92 of 93

Industry External Data Used CDB Search Term Used

Pearson’s 46

Population Correlation 47

Wind energy

DoE wind generating capacity

“windy”0.07 0.10

NREL wind resource availability 0.25 -0.36

Solar energy Thermomax solar energy (BTUs)

“warm” -0.11

-0.09“sunny” -0.28

“sunshine” 0.22

Rain

NOAA precipitation per square mile 2008

“rain”0.29 -0.02

NationalAtlas.gov 1961-1990 0.27 0.01

Fishing USFWS Non-resident fishing licenses sold “fishing” 0.46 0.19

CoalNMA Number of coal mines

“coal”0.75 0.18

NMA Coal production 0.74 0.08

Gemstone NMA Gemstone production “gemstone” 0.30 0.09

Gold NMA Gold revenues “gold” 0.29 -0.06

Forests NFS Forest area “forest” 0.30 0.18

Oil EIA Oil production “oil” 0.39 -0.04

Mountain climbing USGS Elevation Data “mountain climbing” 0.65 -0.10

Eco-tourism USBLS Number of eco-tourism employees “ecotourism” 0.39 0.35

Gaming USBLS Number of game dealers “gambling” 0.29 0.32

Table 5: Summary of Experimental Results – Non-Population-Sensitive Industries

46 Pearson’s here is the correlation obtained by comparing the state ranking using external data to the CDB ranking of states by search term frequency (search term hits per 1,000 words).

47 Population Correlation is the Pearson -value found when correlating the ranking for the states from the external data with the ranking of the states by their population. Population figures were obtained from the US Census Bureau. Note that the eco-tourism and gaming industries display some population sensitivity, though much milder than the population-sensitive industries studied in Table 4.


Page 93 of 93