Upload
hanga
View
214
Download
1
Embed Size (px)
Citation preview
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 1 of 93
ABSTRACT
Document classification has long been a popular field of research in information retrieval.
Classification of documents is typically used to aid in faster location of relevant documents by end
users. In this paper, we present a method for constructing a Categorized Document Base (CDB) and
assess whether the categorization of a document collection in this manner can be helpful for another
purpose: understanding of, and comparison of, the categories in the document collection, through the
use of aggregate statistics from the documents in each category. Our experimental results indicate that,
taking a convenience sample of properties for which we were able to obtain independent assessments of
their values, there is a relatively clear association – in non-population-sensitive industries – between
what the CDB-based methods turn up and what the independent data says. This is evidence that CDB-
based methods could be useful for gauging non-population-sensitive properties for which independent
data does not exist.
Key words
Text Mining, Information Retrieval, Categorization
1. INTRODUCTION
Modern search engines have demonstrated their ability to retrieve and rank documents relevant to a
given search term and are well suited to finding documents relating to different topics. However, what
if an end-user would like to compare topics instead of merely retrieve a document? It is relatively
straightforward to take a categorization scheme, and pass each category (topic) in the scheme as a search
terms to a search engine, in order to gather a certain number of relevant documents for hundreds or even
thousands of categories. Once the few most relevant documents for each category have been obtained,
categories (topics) can be compared by computing aggregates on the documents for each category –
such as hit counts or relative frequencies for a specific word or phrase in the documents for each
category. But, is this information sensible? Will it correlate with what is known about those categories
(topics) from other data sources? Let us take a specific example: Can we reliably determine, from text
documents for multiple locations gathered and analyzed in the fashion described above, how those
locations compare with regard to, for instance, environmental characteristics (sunshine, warmth, natural
resources like oil and coal, existence of mountains, forests, or fishing) or market characteristics (demand
for different products and services across those locations)?
Our goal in this paper is to determine if the analysis obtained from unstructured text documents
using the above-described means is comparable in quality to data for those locations obtained from
conventional structured sources, such as traditional environmental surveys or market statistics. This
paper posits that it is indeed possible to obtain comparative information on locations by employing
search engines in the simple but unusual manner described above. Specifically, our objective is to test
the following hypothesis:
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 3 of 93
Hypothesis H1: aggregate information on various natural and market phenomena,
extracted from text documents for United States locations in the fashion described above,
can provide better-than-random rankings of those locations based on the environmental
or market characteristics of those locations.
As shown in the Experimental Results (Section 6) we find some support for this hypothesis in
industries which are not sensitive to population – in these industries the Categorized Document Base
(CDB) is able to fruitfully compare locations that diverge widely on some environmental or market
characteristic. However, we find the hypothesis is not valid for population-sensitive industries where
the CDB is not able to distinguish subtle per-capita distinctions in market characteristics by location.
This paper, then, describes a novel method for creating and exploiting aggregate information from
CDBs, and assesses the quality of the information so produced. In a CDB, document sets are organized
by topic. CDBs occupy a useful middle ground between highly structured (e.g. tabular) data, and
completely unstructured (textual) data, by introducing some order and organization into the document
set. We aim to show that aggregate information can be derived from categorized document sets, and
that this information is valuable.
We begin with a motivation for this research, and a discussion of related work. We show how our
CDB approach differs from prior work in document classification: in particular, the prior art focuses on
the use of categorization to narrow down search results, whereas our CDB approach is targeted at
aggregate analyses. Next, we describe the process for creating and querying a CDB: from creation and
population of the classification scheme, and construction of the search term, to tabulation of the results
for the sub- and super-categories, and integration with other external data for the categories. We
document the technical implementation of our CDB prototype, including the system architecture and
data structure. To validate our approach for the creation and analysis of categorized document
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 4 of 93
collections, we conduct experiments in a variety of industries. In these experiments, aggregate
information is generated from the CDB, categorized by location. Our experiments reveal that the CDB
can act as a rough instrument for discerning differences between locations: for industries where the
difference between locations are substantial, the aggregates computed by the CDB for each location are
statistically correlated with quantitative measures that we found for related natural and commercial
phenomena in those industries, from traditional structured data sources. In contrast, for industries with
subtle differences between locations, the CDB does not appear to be able to discern differences between
locations. Following our experiments, we discuss a number of useful applications of the CDB, and
conclude with limitations of the approach, areas for future work, and a summary of our process and
results.
2. MOTIVATION
Modern search engines produce a large number of relevant results for a given search term [111].
Unfortunately, comparing the results for different search terms – for example, comparing the results for
different categories in a taxonomy – is difficult. To compare the categories, users must consult
hundreds of documents and users soon suffer information overload: they quickly reach a futility point
[11] beyond which they will not search. Current research aimed at categorizing documents or search
results typically has as its goal narrowing down from a large collection of documents, to progressively
smaller collections, until a single document or set of documents which are relevant can be found [5, 15,
34, 47, 73, 74, 79, 80, 81, 136, 141, 143]. In contrast, the motivation for the current research is to
provide aggregations of search results across multiple categories, to facilitate comparison of the
categories themselves. Our goal is to allow the user to evaluate and compare categories using
aggregate statistics, rather than the conventional goal of allowing users to find a relevant document
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 5 of 93
more rapidly. Effectively, we are looking to create new summary information for categories by
categorizing, ranking, and filtering millions of documents, rather than simply trying to find facts already
explicitly encoded in single documents. The summary information allows a management user to
determine the prevalence of a search phrase in millions of relevant documents for thousands of
industries, products, places, time periods, or other categories. This allows the manager to rank
industries, products, places, or topics, which, we suggest, could be helpful, for instance, for devising a
market rollout strategy. Of course, the analysis of documents from multiple categories is only useful if
the comparison it yields of the categories is a fair reflection of reality. Our goal is to test the validity of
the comparison generated from unstructured text, and to determine, using external evidence, whether the
comparison is a sensible and viable proxy for other available comparative data on the categories1.
3. RELATED WORK
Manual and automatic categorization of documents using classification schemes is a well studied field
of research [70]. In this section, we provide a thorough survey of the existing literature on document
classification, and we explain how our work relates to the document classification field.
Basic approaches to document classification involve manual tagging by humans: for example
Yahoo! [82] and domain specific subject gateways [134], both of which are quality-controlled document
1 The reader may be curious as to why compilation of aggregate data from text is useful if similar data is already available in structured sources. There are numerous reasons why the ability to glean the information from text is helpful:
a) Cost: Aggregate information from text may be cheaper to generate than using alternative sources for that information (e.g. gleaning data from text on locations may be cheaper than conducting geological surveys on those locations – for natural data – or market or sociological research on those locations – for commercial or social data.)
b) Time: Aggregate information from text may be more current if the text is current (e.g. alternative sources, typically constructed by manual human market research may be years out of date).
c) Breadth: Aggregate information from text may provide data on a broad range of phenomena that have not yet been the subject of manual surveys. E.g. while the economic effects of federal stimulus money on various locations can be readily assessed from tax return data, social effects are more difficult to rapidly and cheaply observe and would traditionally require manual surveys that may be prohibitively costly. If it can be shown that aggregate information from text is plausible, textual sources may more readily be consulted to assess such social effects, particularly in cases where manual surveys of a certain social phenomenon do not exist.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 6 of 93
collections, organized by subject. Numerous document classification approaches use automatic
classification by machine. Automatic text categorization research is many decades old [8, 9], and
dozens of text categorization and topic identification approaches have been suggested in the literature.
Automatic text categorization approaches can broadly be broken down into unsupervised and supervised
approaches:
3.1 Unsupervised automatic document classification
Unsupervised approaches for document classification include, inter alia, the use of Kohonen
self-organizing feature maps or neural nets to cluster documents [26, 44, 45, 56], hierarchical
clustering [23, 66, 133], Cluster Abstraction Models [59], and Suffix Tree Clustering based on
shared phrases [18, 140]. These approaches are termed ‘unsupervised’ as the algorithms are not
trained using a historic set of correctly classified documents but, rather, learn by finding
similarities within documents, and clustering documents by similarity.
3.2 Supervised automatic document classification
In supervised approaches a small number of documents are manually tagged with a limited set of
category identifiers. A machine learning approach is then used to infer general patterns so that
unseen documents can automatically be assigned to the categories. In Neural Network
approaches [27, 32, 88, 94, 113, 129, 135] a training set of documents and their categories is
used to adjust the arc weights in a graph of nodes composed of mathematical functions, until a
desired classification accuracy is obtained. Bayesian approaches [9, 21, 61, 69, 78, 80, 83, 100,
124, 130] make use of conditional probabilities to determine the likelihood of a document being
in a class given certain features of that document. In K-Nearest-Neighbor approaches [19, 58,
49, 50, 80, 84, 98] a document is compared to a set of pre-classified documents, and its class is
taken as the majority class of the few most similar documents. With rule induction approaches
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 7 of 93
[1, 25, 76, 89] a set of rules is induced from a historic set of documents pre-tagged with their
categories, and the set of rules is then used to classify unseen documents. The induced rules
have the form “if (x AND y AND z AND …) then (document is in category P)”. Decision trees
[2, 19, 90] proceed similarly, except individual rule clauses – that is, simple condition
evaluations – are applied in turn, starting with a root clause, and proceeding with different
clauses in a branching fashion depending on the outcome of each condition evaluation. With
Support Vector Machines approaches [65, 85], vectors of n document features are defined, and
the hyperplane that provides maximal separation in the n-dimensional space is used to divide the
documents into separate categories. Genetic programming approaches [28, 126] start with
preliminary candidate classification rules – the best candidate rules are identified, and combined,
to produce successive generations of improved rules. Some automatic document classification
approaches employ Rocchio Relevance Feedback [64, 80] – these are based on Rocchio’s
algorithm [110] which attempts to find a query vector that maximizes the similarity of
documents in the same class, while minimizing their similarity to (i.e. maximizing the distance
from) documents in the alternate class. Miscellaneous other automatic document classification
approaches also exist: see, for example [67, 75, 91, 92, 131]. For comparisons and surveys of
text categorization approaches, see [47, 86, 117, 118, 137, 138, 139], and for a list of some
software implementations see [43].
Automatic document classification approaches typically make their inferences using internal
document content, external information, or user behaviors. Mechanisms that use document content [26]
may make use of key words, word strings / n-grams [54], linguistic phrases [106, 107, 108, 109], word-
clusters [124], or multi-word features [101] within the document. In contrast, some schemes use
external information in hyperlinked documents that point to the target document [40]. Finally, some
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 8 of 93
approaches make inferences from search logs and/or behavioral patterns of users searching and
accessing the document set [22] or retain user preferences [131, 132] when categorizing documents.
Modern search engines, such as Google, employ a variety of means – including internal document
content [7], external information [7], and user behaviors [142] – to determine whether a document
should be classed in a particular result set, and how it should be ranked.
By employing modern search engines to construct our CDBs (see Section 4.2), the CDB
construction approach presented in this paper benefits from the state-of-the-art composite classification
algorithms that are effectively available by sending search terms for multiple categories to Google,
Yahoo, or any other search engine.
Many practical software implementations of automatic classification systems exist. GERHARD
(German Harvest Automated Retrieval and Directory) automatically classifies web content according to
the Universal Decimal Classification (UDC) scheme [97]. WebDoc [130] uses a naïve Bayes approach
to assign documents to the Library of Congress classification scheme. Scorpion automatically assigns
Dewey Decimal Classifications or Library of Congress Classifications to documents [48, 120, 128].
Commercial text mining software, such as SAS Text Miner2 clusters documents into categories by
extracting key terms from each document. Commercial document content mining tools – often known
as customer experience intelligence software products – such as Aubice3, Clarabridge4, IslandData5,
QL2 MarketVoice6, and similar products, are targeted at monitoring market trends, customer feedback,
or potential fraud, by scouring, categorizing, and summarizing web documents. Larkey and Croft’s
system [80] uses a variety of techniques to automatically assign ICD9 codes (i.e. category of diagnosis)
2 http://www.sas.com/technologies/analytics/datamining/textminer/3 http://www.aubice.com/4 http://www.clarabridge.com/5 http://www.islanddata.com/6 http://www.ql2.com/
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 9 of 93
to dictated inpatient discharge summaries. Multiple systems for automatically categorizing news stories
exist [10, 57]. Though the afore-going list is not exhaustive, it demonstrates that automatic document
classification systems have enjoyed broad application.
A number of authors [5, 15, 28, 34, 47, 73, 74, 79, 80, 81, 131, 136, 141, 143] have proposed
techniques for clustering search results by category. The intention is typically to disambiguate different
senses or uses of the search term, or to simply visualize common topics in the search results, to facilitate
narrowing in on the correct document subset. For instance, “wind” results may be separated into
categories like:
“Weather” (containing documents like “A guide to wind speeds”)
“Movies” (containing documents like “Gone with the Wind”)
“Energy” (containing documents like “Wind Energy”)
Some approaches involve the post-processing of search results, so as to organize them by category,
to support more rapid navigation to the desired search results [5, 24, 140]. For example, commercial
search engines such as Northern Light7 and Vivisimo.com / Clusty.com, use post-processing to
categorize search results. In other cases, the document base is organized by category before conducting
the search (e.g. [105]).
In the field of Faceted Search [51, 52, 60] (also known as View-Based Search [104], Exploratory
Search [95, 114] or Hierarchical Search [39]) objects are catalogued, by annotating them with attribute
values or classifications, to facilitate navigation through or exploration of the catalogue. Objects in the
catalogue may be documents, web pages, products (e.g. consumer products), books, artworks, images,
people, houses, or other items. For example, the Flamenco faceted search engine, demonstrates use of
7 nlresearch.com USA Patent No. 5,924,090
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 10 of 93
faceted search to browse catalogues of recipes [51], architectural images [52], Nobel prizewinners8, and
other domain-specific catalogues, with the ability to drill down by category (e.g. recipes may be divided
by ingredient, such as celery, onion, or potato). In a faceted search system, annotations may be
manually assigned or automatically assigned using information extraction techniques. The annotations,
or tags, allow the objects in the catalogue to be progressively filtered, to assist the user in finding a
particular type of item, and also allow the user to easily determine how many items of a certain type
exist in the catalogue. For example, a catalogue of products may be annotated to allow the user to find
all products of a certain brand, and within a certain price range. Faceted search techniques are very
useful for navigating a catalogue of a certain type of item through progressive filtering.
The mechanism we propose in this paper differs in a number of ways from the prior art. Firstly, we
gather documents into categories by obtaining the top results, by relevance, for each category, from any
existing search tool (e.g. Google, Yahoo, etc.). The authors currently have a patent pending on this
CDB construction and usage approach9. Secondly, our method is focused on exploration of the
aggregate statistics obtained from documents organized in a classification scheme, and not on searching
for a particular document. Unlike faceted search techniques, the CDB described in this paper is focused
on comparing categories using different metrics, rather than focusing on finding a particular type of
item by category, or on counting the number of items in each category. In contrast to faceted search
systems, where every category is populated with a certain type of item, CDB’s populate categories with
the most relevant documents for those categories.
The contrast between faceted search techniques and the CDBs we describe in this paper is best
illustrated with an example. Consider a user wishing to compare US locations to see where fishing is
popular. Faceted search engines that are used to catalogue artworks, people, houses, or books are 8 http://flamenco.berkeley.edu/demos.html (Accessed on 13 March 2009).9 United States Patent Application 20070106662 “Categorized Document Bases”.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 11 of 93
obviously not relevant, so the user looks for a faceted search engine that provides a catalogue of web
pages or images. The user types ‘fishing’ as the search term, but the results are categorized according to
classification schemes mandated by the faceted search engine. Figure 1 shows the results produced, for
example, by searching for ‘fishing’ on clusty.com. Figure 2 shows the results of the same search
performed using an alternative search results visualization engine, Kartoo.com. As can be seen from
both figures – which are indicative of the nature of output produced by search engines able to cluster or
categorize search results – information about fishing locations is haphazard and locations cannot be
compared. The user attempts another search ‘US locations fishing’. Figure 3 shows the results
produced by clusty.com10. Again, the result categories do not allow US locations to be compared to see
where fishing is popular. What the user really needs is, first, a dynamically constructed catalogue of US
locations, and then the ability to compare these locations for fishing. This is exactly what a CDB
provides: the ability to first generate an appropriate catalogue, and then allow comparison of categories
using an additional search term. In our approach, the user begins by obtaining a classification scheme
of US locations from the United States Geographic Names Service, and then feeds the classification
scheme to the CDB. The CDB populates each category (US location) with data, and the user is then
able to run a search term (‘fishing’) against the categories. Figure 4 shows the results produced by the
CDB. The US locations have been grouped by state, and the states have then been ranked by the
number of hits per state. The user can drill-down to see the number of hits per US location by clicking
on a US state. The report allows for easy comparison of US locations by fishing popularity 11. In short,
CDBs allow the user to catalogue a type of item of their own choosing (by feeding a classification
scheme to the CDB), whereas faceted search engines catalogue a certain type of item pre-chosen by the
10 Kartoo’s results, not shown for brevity, are similarly haphazard.11 For now, the reader can assume the results produced are sensible - in §6 EXPERIMENTAL EVALUATION we evaluate the validity
of the results produced by the CDB, and find that the CDB produces results that are correlated with United States Fisheries and Wildlife service data on fishing popularity in those states.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 12 of 93
developer of the catalogue. The CDB presents a new way of creating faceted search engines. By using
the taxonomies themselves in creating the CDBs we introduce a level of flexibility and scope of
application that is materially beyond that of present systems.
As should be evident from the discussion above, the CDBs results are aggregates for various
categories, and are not record-oriented. For instance, for the term “wind power”, a traditional search
engine would lead the user to the document “Wind power for the home”. In contrast, a CDB
exploration would suggest that “wind power” is a greater concern in Idaho than it is in Mississippi, as it
is more prevalent in documents in the Idaho category than it is in documents in the Mississippi category.
For a thorough comparison of aggregate versus record-oriented document search results, see [31]. In
this paper we aim to show that useful category-comparative aggregate information, heretofore
unobtainable, can be delivered by an appropriately constructed CDB.
Our work can, in some sense, be seen as partially analogous to work in the field of Online Analytic
Processing (OLAP) [33, 42]. In OLAP, structured data is organized by category and aggregated: for
instance, to find total sales by product, customers by city, or complaints by franchise outlet. OLAP
technologies are widely deployed in practice – examples include Microsoft Excel PivotTables and
PivotCharts, Microsoft Access CrossTab reports, SAS JMP Overlay Charts, and similar aggregate charts
in special purpose reporting tools like Crystal Reports, Cognos, Oracle OLAP, Hyperion Analyzer (now
part of Oracle), Microsoft Data Analyzer, and other products. Rather than providing tabular and
graphical aggregates of structured (tabular) data, CDBs produce category-by-category aggregate
statistics from document collections. Producing aggregate statistics from unstructured text data is not
new. For example, a pioneering system by Spangler et al. [6, 53, 121, 122], computes aggregate
statistics for various categories and features of documents in a document collection in order to provide
exploratory insights into the constitution of the document collection, and in order to infer relationships
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 13 of 93
between categories found in the document collection. Their process, which is implemented in IBM
Unstructured Information Modeler12, involves first finding a subset of documents to analyze (e.g.
problem tickets relating to a particular brand). Summary statistics – word and phrase occurrence counts
– are then computed for each document in the collection. The documents are clustered using a variation
of the k-means algorithm, and one or more user-editable taxonomies are created which categorize the
document set. Finally, the relationships between various categories are analyzed to determine
correlations – for example to determine whether a particular call center representative has substantially
more unresolved tickets than their colleagues, or to determine whether particular product complaints are
associated with a particular reseller. The process used to construct CDBs is markedly different. CDBs
are constructed by populating each category in a user-supplied taxonomy with the n most relevant
documents for that category. The categories can then be ranked based on the number of hits in each
category for a user-specified search phrase. CDBs allow users to rank categories by hit-count or term
frequency for a chosen phrase, using a limited set of only the most highly relevant documents for each
category as source data. Using only the most relevant documents from each category allows the user to
compare categories against each other, using an arbitrary comparison metric chosen by the user (e.g. a
user might compare the relative frequency of a particular word or phrase, across categories).
12 http://www.alphaworks.ibm.com/tech/uimodeler
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 14 of 93
4. PROCESS
The process of constructing and using a Categorized Document Base can be separated into the following
phases:
1. Creating a classification scheme
2. Populating the classification scheme with documents
3. Creating an aggregate search term (comparison metric)
4. Determining the aggregate results for each category
5. Integrating the aggregate results per category with external data for the categories
6. Collaborative annotation of CDB results, if users in a team wish to share their comments on
particular categories of interest they have found, with each other
We discuss each of these phases in depth in the following subsections:
4.1 Creating a classification scheme
Classification schemes are often referred to as hierarchies, taxonomies, or coding schemes. We
will use these terms interchangeably in this paper. A classification scheme can be created by
importing any existing taxonomy, or defining a new taxonomy13. Table 1 provides examples of
some popular classification schemes. The administrator of the CDB may use one of the
provided classification schemes, or may import or create a proprietary classification scheme,
such as a product hierarchy from a product catalogue. Use of a standard classification scheme is
helpful as other data providers typically provide statistics using codings from such classification
schemes, and these statistics can be easily integrated with category-specific aggregates from the
CDB (see Section 4.5 later). For example, profit margins by industry, coded according to the 13 Though, as we will see later (§4.2), certain taxonomies are more amenable to analysis by a CDB – specifically, taxonomies with
relatively unambiguous category descriptors are better suited to analysis using a CDB – while others are, at the present time, not.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 15 of 93
NAICS classification, are provided by the United States Internal Revenue Services, and could be
integrated with aggregate document statistics for each industry from the CDB. Popular
taxonomies can be organized by type (e.g. product hierarchies, industry hierarchies, place
hierarchies, activity hierarchies, time hierarchies, …) to aid in targeted exploration (see Table 1).
4.2 Populating the classification scheme with documents
The Categorized Document Base is created by populating the classification scheme with relevant
documents. Our method involves iterating through every category in the classification scheme,
and providing the category name, or other identifying features of the category14, as a search term
to a standard Information Retrieval tool or search engine (such as Yahoo, Google, Lycos,
Medline, Lexis/Nexis, etc.). The search engine returns a list of matching documents, and the
most relevant n documents returned by the chosen search engine are stored under that category
in the CDB. The full text of each document is stored. Note that a document may be assigned to
more than one category, if it is relevant to more than one category. However, search engine
rankings usually ensure that the documents that are most relevant to a given category do not
14 As category names may be ambiguous, it is preferable that additional identifying features of the category be specified, or some other
mechanism be employed to obtain documents for the correct category. For example, the place “Reading” in Pennsylvania, in a
location taxonomy, is different from the category “reading” in an activity taxonomy, so documents from these two categories should
not be mixed. Supplying the search term “Reading” to, for instance, Yahoo, would typically return documents from both categories,
whereas the CDB should only store documents relevant to the place “Reading” when populating documents for “Reading,
Pennsylvania” into the CDB. One method for disambiguation is to append the parent category name to the child category name: for
example “reading”, becomes “Reading, Pennsylvania” or “reading activity”. Capitalization may be used to distinguish a proper noun
from a common noun and using a quoted phrase to ensure co-occurrence of the city and state name reduces ambiguity further. Also,
because modern search engines often consult recent search history for the purposes of results personalization and query
disambiguation (see Google’s United States Patent Application 20050222989), the results returned to the CDB are likely to be further
disambiguated as the CDB is repeatedly requesting search results for similar categories. A vast number of other disambiguation
techniques exist [3, 62, 116] and can be used.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 16 of 93
appear in the top results for any sibling category. Figure 5 shows schematically how categories
in a taxonomy are populated with documents to create the CDB: in this example, a taxonomy of
markets (industries) is populated with the most relevant documents for each industry.
The reader may be concerned that web-pages are long and information on multiple
categories may co-occur within the same web-page, potentially leading to the page being
multiply classified, and requiring the use of short snippets of text from each page to ensure that
the text in the document set for a category relates only to that category. While this concern is
reasonable, it should be noted that, by nature, our process populates each category only with the
few most relevant documents for that category, from a search engine. Each document in that
category should be wholly (or at least predominantly) relevant to the category provided the
category name is unambiguous and sufficient content exists on the internet for that category to
allow the search engine to easily retrieve highly relevant documents for that category. Consider
an algorithm like Google’s PageRank [7], retrieving documents on the category “Blacksburg,
Virginia”. Documents that have content on Blacksburg Virginia alone and have inbound
hyperlinks referring to Blacksburg Virginia alone will have higher rank than documents that also
discuss Philadelphia, Pennsylvania or have inbound hyperlinks from Philadelphia, Pennsylvania.
Thus, by using only the top-ranked search engine hits for each category, our process typically
ensures that each document for that category predominantly refers to that single category, and
therefore the use of snippets is ordinarily not required.
Having said that, we must qualify our explanation by pointing out that a requirement of a
well-constituted CDB is that categories are unambiguous and sufficient documents exist for even
obscure categories. This is because, when either the category name is ambiguous, or insufficient
documents exist for obscure categories, the documents in the document base for a category could
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 17 of 93
contain irrelevant information, or information for multiple categories. Consider the United
States Patents and Trademarks Office (USPTO) patent classification scheme, which contains
approximately 155,000 classes and subclasses. Our analysis of the USPTO scheme revealed
approximately 2,650 ambiguous categories: sub-categories of the same name that occurred under
different parent categories. Major offenders were “CLASS-RELATED FOREIGN
DOCUMENTS”, “MISCELLANEOUS”, “PLURAL”, “ADJUSTABLE”, “PROCESSES”, and
dozens of others, each of which appeared under multiple parent categories. Clearly, these
category names would have to be disambiguated in order to intelligently populate the category
only with relevant documents for the correct sub-category. For obscure categories, the content
available on the specific category is sufficient sparse that only a few documents mention the
category, and those that do also mention other sibling categories. In the USPTO scheme, which
contains dozens of obscure categories (e.g. “Using mechanically actuated vibrators with pick-up
means”), a general-purpose search engine such as Yahoo is unable to find documents highly
relevant to the specific category and yields, instead, pages containing excerpts from the USPTO
scheme itself, mentioning dozens of other USPTO categories, and not particularly relevant to
any one USPTO category. It can be concluded that use of the USPTO scheme with a general-
purpose search engine, such as Yahoo, will not yield a well-constituted CDB, as the USPTO
scheme does not supply unambiguous category descriptors, and, further, a general-purpose
search engine will not find sufficient relevant documents for obscure USPTO categories.
Turning to another classification scheme, the United States General Services Administration
Geographic Locator Codes (US GSA GLC) list of locations, we found the problems encountered
above with the USPTO scheme to be surmountable. While many ambiguous category names
exist in the US GSA GLC list of locations – for example, there are cities named “Philadelphia”
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 18 of 93
in PA, IL, MO, MS, NY, and TN – this was easily rectified by simply appending the state name
to the location name in order to disambiguate. Furthermore, a small sampling of various obscure
towns indicated that sufficient pages exist on the internet (e.g. local tourism pages for those
towns) to yield highly relevant results when we requested the top few hits for the specific town
and state on a general-purpose search engine such as Yahoo. We therefore proceeded with the
use of the location classification scheme as a viable taxonomy on which to conduct further
experiments to more fully assess CDB usefulness and robustness. As mentioned, we were
careful to always include the state name with the city or town name, when populating each
category, to reduce ambiguity.
As we have seen, in the absence of sufficient content relevant to a specific category (as in the
USPTO case described above), general purpose search engines cannot populate each category
with wholly relevant documents. Furthermore, it can in certain circumstances be a challenge
(again as in the USPTO case), to provide unambiguous search phrases for each category,
especially when tens of thousands of categories need to be populated. This can result in
irrelevant documents being included in the document set for a category. A human user might
consider removing documents from categories, editing the documents, or reassigning them to
different categories, if they believe the document was incorrectly assigned, is from an unreliable
source, is inaccurate, or inappropriate to that category for some other reason. For example, a
user may remove document 6780.html from the “Chicago” sub-category of “Illinois”, if they
notice the document pertains to the movie “Chicago”, rather than to the place. In another
example, a user may notice that document 13581.html has been assigned to both categories
“Tilapia” and “Trout”. On further investigation, the user realizes that the document is merely an
alphabetic grouping of fish, and therefore contains content relevant to more than one type of
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 19 of 93
fish. The user could decide to remove the document from the CDB, and replace it with two
edited versions – one that refers only to tilapia, and another that refers only to trout – so that the
trout-related content of the document doesn’t impact the “Tilapia” category, and vice versa.
Similarly, the user may extract advertisements or sponsor links from a web-page document, if
the user believes these items introduce content not relevant to the category.
While human editing of categories is permissible, it has its drawbacks. Firstly, it is
extremely time-intensive, particularly when thousands of categories must be edited. Secondly, if
not done in a systematic and consistent fashion, different biases can be introduced into different
categories in the document set. We therefore do not advise human editing of categories. Rather
we recommend that, prior to any taxonomy being fed to the CDB, the taxonomy first be
assessed, and any taxonomy with ambiguous or obscure categories be shunned, to avoid the need
for manual human editing. In our experiments (Section 6), we avoided the need for human
editing by using a classification scheme that could be unambiguously populated to a satisfactory
extent without any manual intervention. Specifically, by automatically appending the state name
to the location name for every location, we could ensure that, for instance, results pertained to
the location “Chicago, Illinois”, rather than to the movie “Chicago”. Further, for the reasons
described above, we could be confident that the few top ranking documents for each category
were highly relevant to the category alone, did not contain mention of sibling categories, and
therefore did not need to be manually excerpted (‘snippeted’) nor edited for relevance.
4.3 Creating an aggregate search term (comparison metric)
The Categorized Document Base is queried by specifying a search term (which can be thought of
as a ‘comparison metric’), and optionally some additional parameters. The search term typically
consists of one or more words, and aggregate statistics computed for the search term in each
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 20 of 93
category of documents allow the user to compare categories. In a simple case, the user asks the
CDB to compile aggregate statistics for each category that show the relative frequency of the
search term in each category. Composite search terms can be created so that counts tally hits on
any of the terms, all of the terms, or only the exact phrases (i.e. have the terms in that specific
sequence).
There are a variety of known ways to augment a search term to be used for purposes of
searching with an Information Retrieval engine [36]. These techniques are often known as query
expansion (or query augmentation). The expansion is typically intended to improve precision
and/or recall, by finding “hits” that do not match the literal search term. For instance, for the
search term “dry” a user may be offered the following expansions:
Synonyms: exsiccate, dehydrate, dry up, desiccate
Antonyms: wet, moisten, wash, dampen
Related words: desiccant, drier, drying agent, siccative
Stems / truncations (e.g. “dry”), and derived / inflected forms (e.g. “dried”, “drier”)
Troponyms / Hypernyms / Hyponyms: a troponym is a word that denotes a manner
of doing something – for example “dehydrate” is a manner of “drying”, so it may be
helpful to search on “dehydrate” when searching on “dry”. A hypernym is a word
which denotes a subclass of a superclass: for example, “freeze drier”, “vacuum drier”,
“spray drier”, and “oven” are all hypernyms of “drier” since they are all types of
drier. The superclass (“drier”) is called a hyponym.
Meronyms / Holonyms: a meronym is a word that names a constituent part of a larger
item. The word for the larger item is called a holonym. E.g. “fan” is a meronym of
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 21 of 93
“oven” as a fan is a constituent part of an oven. Similary, “oven” is a holonym of
“fan”. As fans may be used for drying, and fans are part of ovens, a user searching
on “drying” may also be interested in “fan” and “oven”.
It is also desirable that the user select the specific sense of the word that they wish to search.
This is known as query disambiguation. For instance, ‘dry’ has, inter alia, the following
different senses: ‘lacking moisture’ (as in “dry clothes”), ‘ironic or wry’ (as in ‘dry humor’), or
‘having a large proportion of strong liquor’ (as in ‘dry martini’) [38]. In the absence of a sense-
sensitive search engine, a monosemous synonym – i.e. a synonymous word that has only a single
sense – can be chosen, provided the synonym is in sufficiently popular use. For example
‘dehydrate’ is preferable to ‘dry’, as the latter is highly polysemous. Though ‘dehydrate’ is less
commonly used than ‘dry’, it is still in sufficiently popular use that we can expect to regularly
find hits for it in a document collection. Compare to ‘siccative’ which is monosemous but rarely
used (seldom found in a given document collection), and may therefore not be appropriate as an
alternative search term to ‘dry’.
The user is able to perform both query expansion and query disambiguation in our prototype
via a run-time interaction we have provided with the WordNet lexical database [38]. Figure 6
shows a pop-up screen from our prototype software which allows the user to perform query
disambiguation or query expansion for their chosen search term using WordNet. As Figure 6
illustrates, the user is shown synonyms and related terms for their search term, to allow them to
choose a less ambiguous search term in the event that their chosen term is a highly ambiguous
term. For example, a user contemplating the use of the term “dry” (which has many senses –
e.g. “dry skin” vs “dry humor”), might instead choose to use the related term “desiccant” which
is less ambiguous. As mentioned before, a caveat, though, is that “desiccant” is comparatively
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 22 of 93
rare, and perhaps less likely to produce a significant number of hits if, for instance, document
authors prefer the term “drying agent” to “desiccant”. An ideal search term is one that is both
unambiguous and in common usage.
While we provide run-time, user-driven disambiguation and query expansion facilities via
WordNet, as shown in Figure 6, we do not currently prescribe nor provide any additional
automatic means for semantic-sense-sensitive (context sensitive) search, though many are
available: see, for example, the literature on Word Sense Disambiguation (WSD) [3, 62, 116].
In our experiments, reported in Section 6, we relied on appropriate word choice (i.e. choice of
monosemous and commonly-used words) by the end-user with computer-assistance, using our
integrated WordNet feature as illustrated in Figure 6, where necessary.
It might be suggested that a possible, though labor intensive, means of ensuring that only
documents pertaining to the correct sense of the category name are associated with that category,
is to have human readers, skilled in linguistics, manually remove documents that pertain to
homonyms (i.e. different sense of a word or phrase, which share the same spelling). However,
manual intervention quickly becomes impractical, given the enormous number of documents in
the CDB, and use of our computer-assisted word choice facility (Figure 6), or supplemental
automatic word sense disambiguation (WSD) techniques, as suggested above, is preferable.
4.4 Determining the aggregate results for each category
To create aggregate statistics for all categories in the CDB, the search terms from the previous
section are compared to the documents in each category. Figure 7 shows the basic process: a
search term, in this case, “foam reduction”, is run against all documents in each category – in
this case, only the “Inks” sub-category (under the “Printing” category), has hits.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 23 of 93
A number of basic statistics can be computed for every category in the classification scheme:
total number of hits (word / phrase hits) in top n documents for category
number of documents with one or more hits, amongst top n documents in that category
hits per thousand words (a.k.a. “relative term frequency”), for top n documents in that
category
Table 2 shows the absolute number of hits for the terms “network”, “monitoring”, “devices”,
etc, in the document categories “Minibuses”, “Busses’, “Automobiles and Cars”, etc, for a
sample CDB. Shading is used to indicate where the word occurs with unusually high or
unusually low frequency in the category.
More advanced aggregate statistics can also be created. For example, we can compute the
relative prevalence (a.k.a. lift) of the search term in that subcategory, as compared to similar
categories. A lift of 2 indicates that a word is two times as prevalent in the current category as it
is in other categories chosen for comparison – i.e. it is found two times as often as expected. A
lift of 1 indicates the word is as common as expected: its prevalence is the same in that category
as it is on average in the other categories chosen for comparison. A lift of ½ indicates the word
is half as common as expected. Lift is a useful indicator of interestingness [96]. For example,
common words like “small” may have a high number of absolute hits in a category, but this may
not be interesting, as the relative prevalence, when compared to other categories, may not be
significant or unusual, if the other categories also have a similar number of absolute hits for
“small”.
Formal tests of statistical significance, such as chi-squared tests, can also be conducted to
determine whether the relative prevalence (difference of actual prevalence from expected) is
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 24 of 93
statistically significant. A category is ‘interesting’ or ‘unusual’ if it has significantly greater
prevalence of the search term than expected or, alternatively, if it has significantly lower
prevalence of the search term than expected. For example, documents in the category
Philadelphia may be interesting as “murder” is mentioned more frequently than documents
pertaining to other cities in Pennsylvania. Similarly, documents in the category Erie may be
interesting as “murder” is mentioned comparatively less frequently than for other cities in
Pennsylvania.
Figure 8 shows the relative prevalence of the terms “smoothness”, “strength”, and [“wet” or
“damp”], in various segments of the stone quarrying industry. Three bars are shown for each
industry: from left to right, the three bars for that industry are “smoothness” for that industry,
“strength” for that industry, and [“wet OR damp”] for that industry. As shown by the left-most
bar for each industry, “smoothness” is mentioned almost twice as frequently in Crushed and
Broken Limestone mining, compared to the other segments. “Strength” (the middle-bar for each
industry in the chart) is mentioned almost twice as frequently in Dimension stone mining
compared to the other stone quarrying industry segments. Finally, looking at the right-most bar
for each industry in the chart, we see that [“Wet or “Damp”] is mentioned half as frequently in
Crushed and Broken Granite Mining, than in the other segments. The baseline (i.e. average
absolute number of hits out of total words in the documents) in each category for each search
term is obviously relevant, since a moderate lift, off a low baseline (e.g. a baseline of one or two
absolute hits out of thousands of words in the documents), would not be statistically significant.
It is therefore important that chi-squared tests, of the relevant degree of freedom, be applied to
ascertain whether the lift is statistically significant given the baseline. In Figure 8, the various
baselines for each term are omitted for readability, but different shading is used to indicate lift
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 25 of 93
that is statistically significantly higher than expected (at the 95% confidence), or statistically
significantly lower than expected (at the 95% confidence).
Aggregate statistics for any parent category (that is, a category that has sub-categories), can
be obtained either from:
1. the document collection for that parent category (e.g. a document collection obtained
by finding all documents relevant to parent category “Pennsylvania”), or from
2. aggregates of statistics from the document collections of its children (e.g. aggregates
of statistics from all documents relevant to child categories “Philadelphia”,
“Pittsburgh”, “Erie”, etc. which are children of the parent category “Pennsylvania”)
Both types of statistics are interesting, since the former is obtained from documents directly
related to the parent category, and the latter is obtained from documents which relate to
descendents (i.e. children, grandchildren, etc.) of that category.
Figure 10 shows a hierarchical drill-down view of hits per category, for the word
“biodegradable” across various categories in the United Nations Standard Products and Services
Code (UNSPSC)15.
The absolute number of hits can also be normalized or standardized in various ways. For
example, a large number of hits for “dogs” in Los Angeles CA, as compared to, say, Blacksburg
VA, is unsurprising, as Los Angeles CA has a substantially higher population. Normalizing the
hits, by dividing by the population size in this case to produce per-capita popularity, can provide
an alternative statistic for comparison.
15 http://www.unspsc.org/
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 26 of 93
Statistics can also be calculated for combinations of categories taken from different
classification schemes, in much the same way as On-Line Analytical Processing (OLAP)
statistics are calculated on structured data. For example, taking a CDB of documents
categorized by both Place and Time, we could, for instance, find the statistics for “New York,
September 2001” documents (i.e. documents that appear both under “New York” and under
“September 2001”). This set of documents would have high hits for “trade center”, compared to
documents for, say, “Boston, January 1992”), as September 2001 was the time of the World
Trade Center attacks in New York.
More sophisticated numerical scores for each category can also be computed, using the data
in the document base alone, or in conjunction with other data sources. In general, a numeric
score for a category is any quantitative measure that can be derived from the contents of (i.e.
documents in) that category, or from, or in combination with, a structured data source that
associates that category with some statistic (e.g. ‘population’ is a statistic for the category “Los
Angeles, CA”, that can be obtained from a structured data source).
For our software implementation, we initially implemented a Microsoft Excel interface,
which would download a comma-separated-value text file from the server farm, and allow the
results to be viewed graphically, using tree views and charts. The Excel interface included
intuitive expandable and collapsible tree-views of the various taxonomies, to allow easy
visualization of the aggregate statistics for each category in the CDB by end users. Figure 9
below shows the Excel-based interface – in this case the user, a molecular engineer from a
chemical company, is viewing the hits for the phrase “foam reduction” amongst a number of
product categories in a taxonomy of different product types, in an attempt to identify relevant
product applications for a new foam reducing surfactant she has developed. Categories which
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 27 of 93
have more hits than a defined threshold are shown shaded: in this case, the threshold is
arbitrarily defined as 2 hits per category. Our Excel interface was eventually retired, in favor of
a web-based interface (see Figure 10).
4.5 Integrating the aggregate results per category with external data for the categories:
A new kind of ‘mash up’
‘Mash ups’ are web-based services that weave data from different sources together, creating an
interesting and useful report from the synthesized data [14]. For example, data on coal reserves
by state from a public data source – such as the National Mining Association – can be integrated
with geographic data (e.g. from a mapping service like Google Maps) to create a visual map
showing which states have the highest coal reserves. Similarly, oil output for each state – for
example, obtained from the US Energy Information Administration – could be overlaid onto the
map to visually illustrate which states supply the most energy from fossil fuels.
Categorized Document Bases represent a new source of data on categories, and therefore
provide an additional source of data for mash-ups. A simple example of creating a mash-up
from a CDB would be to take the hits for ‘coal’ and for ‘oil’ in the document sets stored for
multiple states and integrate this with geographic data to create a visual map of how frequently
these terms are mentioned in the documents stored for each state.
Let us consider a more sophisticated example, which illustrates the mashing of data from
both an unstructured source (the documents in the CDB) and a structured source (a spreadsheet).
Assume we have a CDB populated with document sets for multiple industries. For market
research purposes, we may want glean information on those industries from the CDB, and show
it alongside information on those industries from other sources. Figure 15 gives a simple
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 28 of 93
example of such a ‘mash up’: here, a molecular engineer has constructed a bubble-chart, using
our Excel-based prototype implementation16, to explore possible applications of a new
biodegradable compound her company has developed. In Figure 15, the Y-axis is the relative
prevalence (lift) for the search term “biodegradable” in each of the two industry categories
“Oilseed Processing” and “Plastics Packaging” – this data has been obtained from the CDB. As
is evident, Oilseed Processing has far greater relative prevalence for the search term. The asset
turnover and revenues for each industry were obtained from an external structured data source –
specifically, a spreadsheet obtained from the Internal Revenue Service (IRS) – and plotted as the
X-axis and bubble size respectively. From Figure 15 it appears that Oilseed Processing is a
relatively small industry, by revenue, compared to Plastics Packaging. Thus, while Oilseed
Processors are apparently very interested in biodegradable molecules (as shown by the high
relative prevalence of the term “biodegradable” in the document set for that category), sales into
that industry may not be lucrative, given its relatively small revenues. The ‘mash-up’ of
information from the CDB with information from the IRS has yielded thought-provoking
insights into the industries shown. Note that the CDB serves only as a useful heuristic for more
quickly finding possible solutions – further manual study is typically required to validate or
eliminate suggested solutions. In one of our commercial trials, molecular engineers and business
development managers at a chemical company considered 30 industries identified as most
promising by the CDB: the company was already operating in 7 of the identified industries, 12
industries were previously known but the company and its competitors did not operate in these
industries as they had already been found to be unviable, 3 industries were previously unknown
but further investigation showed they were infeasible, 3 industries were previously unknown and
16 Excel was used as it allowed us to easily create bubble charts, and it also allowed us to easily integrate financial data for various industries with the aggregate statistics for those industries from the CDB.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 29 of 93
feasible but deemed to be not promising, and 5 industries were previously unknown and deemed
highly promising.
The examples above illustrate that CDBs can be used to compose interesting ‘mash-ups’:
profound insights into the relationships between categories can potentially be illustrated by
showing aggregate hit results by category (from the unstructured text documents in the CDB)
alongside structured data (e.g. from databases or tabular text files) that are organized according
to the same coding scheme. Table 3 lists some examples of structured data, from both private
and public sources, that have been coded according to standard taxonomies mentioned earlier in
Table 1, and can therefore be integrated with the results of queries on the CDB. Given that we
are able to use these coding schemes to cross-reference CDB results for multiple categories with
existing structured data for those categories, abundant opportunities to create new mash-ups
exist.
4.6 Collaborative annotation of CDB results
In our trials with commercial organizations, our clients requested collaborative annotation
facilities, to allow business development managers and chemical engineers to share their
observations on categories of interest. We therefore rebuilt our visualization features, this time
using a web-based interface, and added collaborative annotation features to allow the technology
commercialization team to share their comments on interesting categories discovered by the
CDB. Figure 11 illustrates the collaborative annotation interface implemented for the CDB, and
shows users sharing comments on possible applications of a biodegradable molecule with foam
reduction properties. The CDB exploration and annotation software and interface were code-
named Sizatola, meaning “help find” in Zulu.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 30 of 93
5. TECHNICAL IMPLEMENTATION
In this section we describe the system architecture and the data structure for our Categorized Document
Base.
5.1 CDB System Architecture
Due to the data volume of the Categorized Document Base, which exceeds the capacity of a
single machine, we implemented a parallel processing architecture for the CDB, allowing it to be
distributed across a number of machines. Partitioning of the CDB across machines greatly
increases the rate of both document gathering and aggregate statistics compilation. A controller-
servant architecture was employed. The controller machine maintained a list of categories,
including the timestamp at which population last started (if it had begun) and ended (if it had
ended) for that category. In order to maintain data freshness, the documents in the category
could be periodically refreshed, for example, every week. Individual servant machines
requested categories from the controller. The controller assigned the servant a list of categories
to populate. If any of these categories do not populate within 24 hours, the controller allowed
them to be reassigned to another servant. If a category has been reassigned to 3 servants, and
still had not populated, it is flagged as problematic, so that a programmer could investigate why
the category was not populating.
Two queues of categories were maintained on the controller: a high priority queue, and a
low-priority queue. If there were any categories in the high priority queue, these were processed
first, otherwise the low priority queue was processed. A category was added to the high priority
queue if a user had attempted a search on that category – the high priority queue was intended to
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 31 of 93
ensure that documents for categories that users are specifically interested in were imported to the
CDB as soon as possible. The low priority queue contains categories that no users had yet
requested, but that could conceivably be requested in the future, thereby ensuring that we had a
forward cache that could rapidly satisfy new requests.
The controller machine was always started first, so that it could listen for requests, for
category lists, from the servants. As each servant was started, it requested a list of categories
from the controller, then populated documents into each of those categories, and notified the
server as each category was completed. The servant requested new categories once it had
completed all of the categories in its current list. To populate an individual category, the servant
requested the top matches for that category from a search engine (e.g. top 10 documents in the
category, by searching Yahoo), and the servant then retrieved each of these high-ranking
documents, and stored the documents in an indexed database (the database structure is shown in
§5.2). The experiments reported in §6 made use of this indexed database structure. During our
experiments, we noticed that the index structure resulted in significant performance impediments
both in populating the CDB and in querying the CDB. Population using this index structure
occurred at approximately 40,000 categories (400,000 documents) per month. A single phrase
query to gather the aggregate statistics for 40,000 categories consumed 2 to 3 days per query.
We have therefore begun experiments with the storage of documents in simple directory folders,
with one directory folder per category, to determine if population and query performance can be
improved in this manner.
Returning now to the controller-servant architecture implemented, the optimal number of
categories for a servant to request depends on the size of the controller’s queue, and the number
and speed of the servants. If the controller’s queue is long, and there are a small number of fast
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 32 of 93
servants, each servant should request many categories, to reduce the number of individual
requests for new category names to populate, to the controller. If the controller’s queue is short,
each servant should request only one category at a time, so as to maximize parallel processing of
this short queue amongst the servants. If too many categories were requested by a single
servant, and there were none left in the controller’s queue, other servants would remain idle
while the overloaded servant churned slowly through its queue. While we implemented only a
simple load-balancing scheme, it is clear that CDB population routines would benefit from
sophisticated load-balancing arrangements – the literature is replete with alternatives [87, 119].
For robustness, the data on each servant could be replicated on other servants, to ensure that
there is no single point of failure – if a server goes down, another server either holds the data, or
a team of servers will gather the data. We did not implement a replication scheme, though one
would be advisable for a production quality system.
When a user inputted a search term, and wished to explore document categories and compute
aggregate statistics for each category, the search term was received at the controller. The
controller determined which servants held the documents for each document category that the
user requested. The controller then contacted each relevant servant, and requested the aggregate
statistics for that search term, for the categories which the user requested, and which were on
that servant. This was an asynchronous process, meaning the controller did not block while it
awaited results, and instead continued with other tasks. When the servant had completed the
calculations, it contacted the controller with its results. If a server went down, it would complete
the requests in its queue only if they were recent (i.e. if the controller hadn’t reassigned them
already). Again, a replication scheme is advisable, but was not implemented. For instance, if
the servant did not complete its calculations within a specified time, the controller should contact
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 33 of 93
an alternative servant that possessed the data. This would ensure that no bottlenecks would arise
when a servant went down.
A cache of query results (aggregate statistics) was maintained, to speed up repeat queries.
For example, if the number of hits on “warm” for the “United States Cities with State Name”
hierarchy had been recently computed, we only needed to re-compute the statistic for sub-
categories whose document collection has changed since the last query. For the US location
hierarchy (constituting approximately 40,000 categories), a complete cache of the number of
word hits and document hits in all categories for a single search term, stored in a Comma-
Separate-Value (CSV) file or Excel spreadsheet, consumes approximately 2MB of storage space.
The CSV file provides near-instantaneous response times to repeat queries involving the same
search term, in cases where the document-base is unchanged since the previous query. The
query response time for repeat queries increases roughly proportionately17 with the number of
documents that have changed since the last query using that same search term.
In our experiments (see §6), we made use of ten Pentium machines, with 2.6 Ghz or greater
processors, 1 GB of RAM, and 500 GB of hard drive space each, totaling approximately 2
Terrabytes of storage. The population and exploration routines were written in Python, and all
data was stored in a MySQL 5.0 databases. We imported the classification schemes shown in
Table 1, and then populated the Categorized Document Base by obtaining the top 10 documents
for each category from Yahoo.com. The taxonomy importation process alone took many weeks,
as each classification scheme was in its own format, and needed to be imported into a standard
tabular format. The full text of all HTML documents was imported. Other document formats,
such as occasional PDF and Word documents, were ignored. After taxonomy importation was
17 The increase is not exactly proportionate since the size of each new document varies, and document size also affects the rate at which term occurrence counts are computed.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 34 of 93
completed, we began to populate the various categories with documents. A total of
approximately 240,000 categories (2.4 million documents) were populated, over a period of 6
months in the latter half of 2007. For the experiments reported in §6 we made use of only a
subset of these categories – approximately 40,000 categories in the taxonomy “United States
Cities with State Name” which were gathered using 6 machines over a period of a few weeks in
summer 2008. The remaining taxonomies and categories were used in industrial trials with
client organizations (see §7), or for initial experimentation that was later abandoned18.
5.2 CDB Database Structure
The database tables used by the CDB can be divided into three major areas:
1. Tables used by the Population routines (§4.2) to represent and index the documents
gathered by the CDB – see Figure 12.
2. Tables used by the Query routines (§4.4) to store and cache queries and their results
(aggregate statistics) from the CDB – see Figure 13.
3. Tables used to implement the Collaborative Annotation interface (§4.6) – see Figure 14.
As mentioned earlier, we found that use of a relational database structure by the Population
routines for the representation and indexing of documents resulted in significant performance
impediments. As clients found the 2 to 3 day delay in result compilation to be excessive, we
have begun experiments with the storage of documents in simple directory folders and our initial
results indicate that we can obtain aggregate statistics for 20,000 categories within a few minutes
using a traditional file storage scheme.
18 For example, for the reasons reported in §4.2, we observed that the quality of documents in the US Patent taxonomy was poor, and this taxonomy was therefore abandoned as unusable.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 35 of 93
6. EXPERIMENTAL EVALUATION
To evaluate the our CDB approach, we set up a number of experiments, in varied industries, including
some industries that relate to natural phenomena, and some that relate to commercial phenomena. We
observed that our studies could be segregated into:
population-sensitive studies (§6.1) – such as burgers, pizza, and hotels – where the
phenomenon varies by population
versus
non-population-sensitive studies (§6.2) – such as solar, wind and rain – which are governed
by natural phenomena, rather than driven by human population.
In all experiments, we made use of a taxonomy of United States place names, taken from the United
States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic
Names Service19. This taxonomy was then populated with the ten most relevant documents for each
place, from Yahoo20. A total of 27,547 individual places were investigated, and the ten most relevant
documents acquired for each. In each experiment, we gathered summary statistics for industry-specific
search terms across these top-ranked documents for each location. We then aggregated the data by
state, and made use of a public, structured data sources for each state to validate whether the findings
from the exploration of the document-based data in the CDB were sensible. In all cases we used
Pearson correlation (Pearson’s ) [102]21 to determine whether the ranking of states as provided by
aggregate statistics from the textual CDB was correlated with the ranking we obtained for those states
19 http://www.gsa.gov/glc20 http://www.yahoo.com/21 We chose to use a Pearson ρ [102] instead of Spearman’s rank correlation coefficient (Spearman’s ρ) [115, 123], since Spearman’s
measure does not cater for ties. Pearson’s ρ is equivalent to Spearman’s rank correlation coefficient, when computed on ranks. In many cases, we also computed Kendall’s tau rank correlation coefficient (Kendall’s ) [71, 72]. However, for brevity, we have not shown Kendall’s in our results, as Kendall’s metric consistently showed similar statistical significance to Pearson’s ρ so does not provide substantial additional information.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 36 of 93
from an alternative structured, quantitative data source, and hence whether the aggregate statistics from
the textual CDB provided a reliable proxy for alternative, widely-accepted quantitative data22.
6.1 Population-Sensitive Studies
We regard population-sensitive studies as those where the phenomenon is likely to vary with
population. To further investigate population-sensitive industries, we selected the top 100
American franchise corporations from Entrepreneur Magazine’s Franchise 500 listing23 for the
year 2008. As some of the franchises were in the same industry, we selected only the largest
franchise from each of the 37 unique industries found24. For each industry, we visited the
website for the largest franchise corporation in that industry, viewed the home-page source code,
and selected up to 4 META tag keywords listed by the company, which described the company’s
main product or products25. We then ran each keyword against the above-described CDB
(populated with documents for each United States location from the US GSA GLC), aggregated
the hits for each keyword by state, and ranked the states from states with the most hits per 1,000
words26 for that keyword down to those states with the least. To determine whether the CDB
ranking correlated with independent data about the popularity of each product in each state, we
visited a commercial data provider, InfoUSA27. For each corporation, we submitted the name of
22 The source data used for our experiments – including summary statistics for each keyword in each state obtained from the CDB, and the statistical calculations we performed – is too large for inclusion here, but is available separately for download from the authors, should the reader wish to validate our findings. Due to copyright restrictions on the quantitative data which we obtained for each state from the external data sources, the authors are unable to redistribute the external data sources. However, we have provided hyperlinks in footnotes in all cases, to allow the reader to obtain the data themselves. Also, owing to copyright restrictions on the Yahoo search results we used to populate the CDB, the authors are unable to make the source documents available, but the reader can compile a similarly-arranged data set using the techniques described in this paper, albeit for a different point in time.
23 http://www.entrepreneur.com/franchise500/24 In the case of Dunkin’ Donuts (coffee and donuts franchise), ranked 3 rd in the Franchise 500, we noticed that the franchise had a
predominantly East Coast penetration in the United States, attributable to the franchises’ unique and peculiar roll-out strategy, and we therefore substituted it with Starbucks corporation, which had a more representative national penetration of stores across all states.
25 In the rare event that META keywords were not available, we manually decided appropriate product keywords for the company.26 We divided raw hit count for the state by number of total words in the documents for that state, to remove population biases which
result from some states having more locations, and hence more documents and words, than other states.27 http://www.infousa.com/
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 37 of 93
the corporation to InfoUSA, and obtained a count of the number of outlets operated by the
corporation in each state. To remove population biases and attempt to ascertain relative demand
for each product in each state, we divided the number of outlets in the state by the population of
that state from the US Census Bureau28, and then ranked the states by number of outlets per-
capita for each corporation. Finally, we compared the ranking of states from the CDB (relative
term frequency for the product keyword in each state29) to the ranking of the states by number of
outlets per-capita operated by the corporation in each state. Our results are shown in Table 430 –
correlations significant at the 90% confidence level are shown in bold. For brevity, and because
these experiments did not generate significant correlations, we have shown only a small
representative selection of industries. As can be seen in column 5 of Table 4, all industries were
indeed strongly population-sensitive as we had expected, with franchise outlets per state being
highly correlated with population of that state. Column 4 of Table 4 shows, however, that
statistically significant positive correlations between term frequency for the search term and per-
capita franchise outlets per state for the industry were seldom found. This indicates that the
CDB is not a credible implement for discerning differences in per-capita demand for different
products between states in the population-sensitive industries in our experiments.
28 http://www.census.gov/popest/states/NST-ann-est.html29 Originally, we computed raw total hit count for the product keyword in each state. However, we found that the strong correlation
between raw total hit count and number of franchise outlets was spurious, since states with more locations had more documents, and hence more words and keywords. The authors are grateful to the reviewers for pointing out this issue. We therefore made use of an alternative metric – term frequency (hits per 1,000 words) – which is not skewed by population.
30 As the table is large, it has been split into 4 parts for readability.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 38 of 93
6.2 Non-Population-Sensitive Studies
We regard non-population-sensitive studies as those which are governed by some natural
phenomenon, rather than by human population. Through a process of group brainstorming, we
identified a short-list of non-population-sensitive industries. For each industry, we identified an
independent external data source that provided state-specific metrics for that industry. We used
the CDB system to rank each state by the metric in question, and then we compared these
rankings with rankings established by the independent external data sources. Following are the
external data sources we gathered for each industry:
Wind energy : We obtained data from Department of Energy (DoE), Energy Efficiency
and Renewable Energy, on current installed wind power capacity, in Megawatts, per state
as at Jan. 31st 200931. We also obtained data from National Renewable Energy
Laboratory (NREL) annual average wind resource estimates, in Megawatts, in the
contiguous United States32.
Solar energy : We obtained data on annual average daily solar radiation, in British
Thermal Units (BTUs) per square meter for a 10 tube solar collector, for each US state33.
Rain : We obtained data from the National Oceanic and Atmospheric Administration
(NOAA), on total inches of precipitation for 2008, for each state34. We also obtained data
from the National Atlas on average annual precipitation per square mile for each US
State from 1961 through 199035.
31 http://www.eere.energy.gov/windandhydro/windpoweringamerica/wind_installed_capacity.asp32 http://rredc.nrel.gov/wind/pubs/atlas/maps/chap2/2-01m.html33 http://www.thermomax.com/usdata.htm34 http://cdo.ncdc.noaa.gov/cgi-bin/climaps/climaps.pl?directive=quick_search&subrnum=35 http://nationalatlas.gov/printable/precipitation.html
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 39 of 93
Fishing : We obtained data from the United States Fish and Wildlife Service (USFWS),
on the number of non-resident fishing licenses issued per state36.
Coal, Gemstone, and Gold : We obtained data from the National Mining Association
(NMA) State Fact sheets, on the total number of mines, total production, and total
revenue, for coal, gemstones, and gold in each US state37.
Forests : We obtained data from the National Forest Service (NFS) on total forest acres
under administration in each state38.
Oil : We obtained data from the Energy Information Administration (EIA) on oil
production for each state39.
Mountain Climbing : Data on the highest elevations in each state was obtained from the
United States Geological Survey (USGS)40.
Eco-tourism and Gambling : Data on employment in these specialty occupations was
obtained from US Bureau of Labor Statistics Occupational Employment Statistics (OES)
state cross-industry estimates41.
After gathering external data for each industry, to allow us to rank and compare the states for
that industry, we compared the ranking of states using the external data, to a ranking of each
state by term-frequency for a search term for that industry using the CDB. Table 5 (column
3) shows the search term keywords we chose for each industry. In each case we computed
36 http://www.fws.gov/news/newsreleases/R9/A2D9B201-0350-4BD4-A73477A70A25FC69.html?CFID=3980850&CFTOKEN=9293532037 http://www.nma.org/statistics/states_econ.asp38 http://www.fs.fed.us/publications/documents/report-of-fs-2002-low-res.pdf39 http://tonto.eia.doe.gov/dnav/pet/pet_crd_crpdn_adc_mbbl_a.htm Note: this document is no longer accessible on the EIA website
but can be found in Google’s cache by searching for the URL using Google.40 http://erg.usgs.gov/isb/pubs/booklets/elvadist/elvadist.html41 http://www.bls.gov/oes/oes_dl.htm
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 40 of 93
(see column 4) the correlation – using Pearson’s – between the state ranking using the
external data for that industry versus the CDB ranking of states by search term frequency
(hits for the search term per 1,000 words) for that industry. We also computed (column 5)
the correlation between the external data and the population for that state, to confirm whether
the industry was indeed not-population-sensitive. As before, correlations significant at the
90% confidence level are shown in bold. For instance, for the mountain climbing industry,
we find a strong (0.65) correlation between the ranking of states by USGS Elevation Data
and the ranking of states by term frequency for the term “mountain climbing”. As is evident
from Table 5, for non-population sensitive industries, we regularly find statistically
significant correlations between the ranking of the states by relative term frequency of the
search term using the CDB, and the ranking of the states using the external data. We
conclude that the CDB is an approximate, but viable, means of comparing states for the non-
population sensitive industries in our experiments, as the CDB rankings are plausible proxies
for rankings of the phenomena obtained from external data.
Given the large number of correlations run, the Bonferonni correction [12, 13] needs to
be applied to determine whether each result, when considered alone, is statistically
significant. After dividing the desired statistical confidence level (p = 0.1) by the number of
experiments (17), the actual p-values obtained are, in most cases, sufficiently low to
conclude that the correlation is significant.
To determine whether the set of correlations, when taken together, is significant, a chi-
squared test ( test) can be performed42. For instance, at a 90% confidence level it is likely
that 10% of studies performed would, by chance, indicate correlations. A -test can be
42 The Bonferonni adjustment is notoriously conservative and lacking in power so a -squared test here is dispositive.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 41 of 93
performed to reveal whether the actual number of correlated studies is significantly different
from 10%. Of the 17 non-population-sensitive industries investigated, 13 produced
statistically significant positive correlations. Though admittedly a small sample, the p-value
for this -test (13 actual correlations obtained in 17 studies compared to 1.7 correlation in 17
studies expected) is substantially less than 0.01, indicating strong statistical significance. We
conclude that the CDB appears to perform satisfactorily for non-population sensitive
industries.
6.3 Discussion
The CDB clearly appears to produce comparisons of varying validity across population-sensitive
versus non-population sensitive industries, with performance seemingly better on non-population
sensitive industries. We speculate that this is the case because the distinction between states in
non-population sensitive industries is far more stark than for population-sensitive industries, and
the CDB is a relatively approximate instrument that is only capable of discerning stark
differences. For instance oil production and mountain climbing vary considerable between
states, whereas hamburger consumption does not and the CDB is incapable of ascertaining the
more subtle distinction.
The reader may notice from the experiments that a number of important challenges remain
for the CDB, such as disambiguating multiple word senses, and determining whether the
statistics generated are indicative of consumer demand, or of market supply. We leave these
challenges for future work (see Section 8).
Attempting to provide quantitative assessments as to the magnitude of a market phenomenon
from qualitative text is not new. For instance, Romano et al [112] developed a qualitative data
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 42 of 93
analysis methodology that successfully predicts box office opening success based on pre-release
free-form comments about the movie on the web. Their work showed promising results. Our
findings, above, concur, and affirm Romano et al’s contention that an appropriate methodology
for the analysis of free-form text can reveal meaningful evaluative information of market
phenomena.
7. APPLICATIONS
CDBs have a number of useful applications, including market research, sales lead prospecting,
competitor or substitute identification, or exploring unfamiliar collections of topics or items.
For market research, the experiments shown in the previous section indicate that CDBs can be a
plausible means of assessing industry penetration by state in a number of industries, as the quantitative
data for certain industries has been shown to correlate with the CDB rankings for descriptive terms for
that industry. While the assessments produced by the CDB are certainly flawed, they are nevertheless
demonstrably better than random, and therefore possess some information value. The CDB method
should be generally useful when one wants a ranking of categories in non-population sensitive
industries, and can tolerate some error, and independent data of good quality does not exist. While we
would not recommend that investments in product roll-out be fashioned directly around the CDBs
findings, we see the CDB as a useful exploratory tool that is able to suggest locations of interest for
further investigation or trials. We have employed the CDB in an engagement with a growing pet
insurance company, PetPlan USA ( http://www.gopetplan.com/ ). In our engagement, we compared the
locations of current customers of the company, with hit counts for ‘dog’ across all US locations, to
determine promising locations of future interest. This information, in conjunction with other
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 43 of 93
intelligence gathered by the organization, is used to inform PetPlan’s marketing strategy. However, as
the CDB provides only an incidental contribution to the overall marketing decisions, it is not possible to
attribute specific dollar benefit to the information obtained from the CDB in this case.
In the area of sales lead prospecting, Du Pont corporation, a Fortune 500 chemical company, has
experimented with our CDB for the identification of prospective markets and customers for their
products. In one of their exploratory investigations, Du Pont made use of a taxonomy of industries, the
North American Industry Classification Systems (NAICS), and searched for hits for various attributes of
a chemical surfactant they manufacture across those industries. To assist the team of business
development managers and engineers with their investigations, we implemented web-based
collaboration features to allow users to capture and share their comments on particular industries that
showed high scores (see §4.6 and Figure 11). For instance a business analyst who comments “this
industry is a large market with few barriers to entry and should be investigated further” may receive a
response from a chemical engineer stating “this industry is unfortunately not feasible as the surfactant is
not food-safe”. Though we cannot attribute any specific new revenue to the CDB, there is anecdotal
evidence that the CDB uncovered industries of interest: trial users at Du Pont reported that the CDB
uncovered unusual industries they had not previously considered as potential markets. One trial user
also reported receiving an unexpected contact from a company in an industry identified by the CDB as
interesting.
For reasons of confidentiality, the following example is fabricated, but illustrative of the process that
can be followed to find new sales prospects. Assume that a salesperson has identified, using a CDB,
that, in comparison to other industries, documents from the plastics packaging industry mention the
attributes of the chemical that she is trying to sell with unusual frequency. The salesperson concludes
that companies in the plastics packaging industry may be interested in her compound. The salesperson
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 44 of 93
is able to use the NAICS or SIC code for the plastics packaging industry to retrieve a list of potential
clients from a public source, such as the United States Securities and Exchange Commission (SEC)43:
Figure 16 shows a portion of the company listing she obtained in this way. The “Navigate” button in
Figure 16 allows the salesperson to select a prospect from the list and click the button to quickly
navigate to the company’s financial reports in order to further qualify the prospect. The salesperson has
successfully integrated knowledge gleaned from the CDB (unstructured data indicating that a certain
industry mentions her product with unusual frequency) with a structured data source (list of companies
in the identified candidate industry from the SEC), and has been rapidly able to identify a previously
unrealized lucrative target market, and construct a list of specific potential prospects.
In the area of competitor and substitute identification, we speculate that, when used in conjunction
with a taxonomy of industries or companies or products, CDBs can be used to identify particular
industries or companies or products that mention certain attributes with unusual frequency. We have
not, however, yet undertaken any academic or commercial trials in this sphere and are currently seeking
research partners to progress such studies.
In the area of exploring unfamiliar collections of topics or items, we speculate that the CDB may be
useful for uncovering topics or items with particular attributes amongst large set of unfamiliar topics or
items. For example, when populated with the most relevant pages for a list of hospitals, the CDB could
be helpful in identifying hospitals with particular specialties (e.g. ‘cardiology’). Similarly, when
populated with the top pages for a list of universities or schools, the CDB could conceivably identify
those with particular attributes (e.g. universities with a specialty in ‘chemical engineering’, or schools
that frequently mention students going on to ‘Ivy League colleges’). The CDB would be especially
useful if the aggregate data the CDB produced from text were combined – ‘mashed up’ – with
43 http://www.sec.gov/. Similar data, listing the companies in a given industry, could also have been obtained from commercial sources, such as Hoovers, Dun & Bradstreet, Microsoft Money, Yellow Pages, or other alternatives.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 45 of 93
structured data from other sources (see Section 4.5), to allow for multi-criterion decision-making. This
would, for example, allow a student to compare colleges offering ‘chemical engineering’ while
simultaneously looking at the annual fees and geographic locations for those colleges. Similarly, a
middle-school parent who would like to relocate nationally to a better school district for their child, may
be able to use a CDB to identify high schools reporting students going onto Ivy League colleges while
simultaneously looking at the median house price in the school’s neighborhood and the property taxes
for the county to assess affordability. Again, these applications are conjectured, and no exploratory
trials have been performed.
8. LIMITATIONS AND FUTURE WORK
As is evident from our earlier experimental evaluation (Section 6), the aggregate statistics obtained from
categorized text vary in their usefulness. This can be due to a number of factors, including ambiguous
terms, presence of negation and antonyms, alternative word forms, the tone of the text, intermingling of
information from unrelated themes or from multiple taxons in single documents, document duplication,
reporting biases, time and location specificity, relativity of reference points used for comparison (e.g.
40°F may be ‘warm’ for an Alaskan but not for a Floridian), human population biases, precision and
recall of the underlying search engine and relevance of its results, number of documents per taxon,
asymmetry in the number of child categories per parent taxon, disjoint phrases, distinguishing between
expressions of consumer demand and expressions of market supply, and other issues. Further
experiments are required to determine the influence of a number of possible alterations to our technique:
for example, using short snippets of text instead of full documents, only storing certain documents types
for each category (e.g. only news articles for that category or only encyclopedia articles for that
category), or using more or fewer documents per category. For reasons of space, we leave it to future
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 46 of 93
work to comment in more detail on the CDB limitations described above, the mechanisms that can be
used to mitigate their influence, and the observed effects of alterations to our core technique.
9. SUMMARY
In this paper, we have illustrated an approach for populating and exploring Categorized Document
Bases (CBDs). CDBs represent a helpful middle ground between unstructured and structured data, since
the documents are well-organized (categorized), though not structured. The CDB is a rough tool,
capable of producing plausible comparisons of categories against each other only when the categories
are starkly different (as in the case of many non-population-sensitive industries).
When setting up the CDB, it is important that category descriptors are unambiguous and sufficient
highly relevant documents exist for even obscure categories in the classification scheme. The aggregate
statistics are independently useful, but can also be integrated with structured data for the categories – for
example, using bubble charts or tables, and using category identifiers to cross-reference from the
aggregate statistics for each document category to the traditional structured numeric data.
We assessed the reasonableness of our CDB approach through a number of experiments that
compared our aggregate results for each category to closely related numeric data, to determine whether
our proxy measures – aggregates derived from textual data – correlate at all with their quantitative
counterparts obtained from well-accepted structured sources. Our experiments seem to indicate that, for
a taxonomy such as the US GSA GLC list of US locations, where the CDB can be mechanically
populated with relevant content for each category, the CDB appears to produce a plausible reflection of
both natural and market phenomena in multiple industries, but only in industries where the locations
under comparison diverge substantially. The results, therefore, appear to partially support the
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 47 of 93
hypothesis at the start of this paper that our CDBs can allow mountains of text information on locations
to be distilled into sensible comparisons of those locations.
We described some applications of our research, including market research, sales lead prospecting,
and rapidly obtaining insights into new collections of topics or items. We also briefly documented a
number of limitations we have found in the CDB population and exploration process – helpful repairs
and alterations, which improve the quality of the results are outside of the scope of this paper and will
be discussed in detail in future work.
In summary, we have described and evaluated a method for the creation and exploration of
Categorized Document Bases, and shown, through varied experiments, that our method can be useful.
Our experiments indicate that the CDB method should be generally useful when one wants a ranking of
categories in non-population sensitive industries, and can tolerate some error, and independent data of
good quality does not exist. It would appear that the CDB approach we have proposed presents a
promising new approach for the benefaction of additional value from textual documents, but much
further work is needed to refine the CDB construction and usage method presented.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 48 of 93
10. ACKNOWLEDGMENTS
Our thanks to the following research assistants, who assisted with the implementation and evaluation of the features and
algorithms described in the main text:
taxonomy importation scripts: Jason Gurwin, Debbie Chiou.
population routines: Joseph Leary, Ryan Mark Fleming, Shawn Zhou.
results visualization features: David Gorski, Ryan Namdar, Adam Altman, Ankit Choudari, Michael Pan, Myron
Robinson, Mark Weinberger.
experimental evaluation: Shawn Zhou, Ava Zhiyang Yang, Aditya Mehrotra, Peng Chen, Anjay Kumar, Anjay
Aushij, Erik Malmgren-Samuel.
Thanks are also due to:
John Ranieri and Ray Miller, of Du Pont Corporation ( http://www.dupont.com/ ), for championing CDB
experiments within Du Pont, and providing feedback on the results.
Chris and Natasha Ashton, of PetPlan USA pet insurance ( http://www.gopetplan.com/ ), for the provision of
PetPlan’s dog health insurance sales data.
The reviewers, whose suggestions were very valuable in improving the content of this paper.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 49 of 93
11. REFERENCES
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 50 of 93
1. Apte C.; Damerau F.; and Weiss S. Automated learning of decision rules for text categorization. ACM
Transactions on Information Systems, 12, 3 (July 1994), 233-240.
2. Apte C.; Damerau F.; and Weiss S. Text mining with decision trees and decision rules. In, Conference on
Automated Learning and Discovery, Pittsburgh, PA, June, 1998, pp.1-4.
3. Agirre E., and Edmonds P. (eds.) Word Sense Disambiguation: Algorithms and Applications. Dordrecht: Springer,
2007.
4. Attardi G.; Gulli A.; and Sebastiani F. Automatic web page categorization by link and content analysis. In,
Hutchinson C., and Lanzarone G. (eds.), Proceedings of the European Symposium on Telematics, Hypermedia, and
Artificial Intelligence (THAI-99), 1999, pp.105-119.
5. Allen RB.; Obry P.; and Littman M. An interface for navigating clustered document sets returned by queries. In,
Proceedings of the Conference on Organizational Computing Systems, Milpitas, CA, November 1-4, 1993, pp.166-
171.
6. Behal A , Chen Y , Kieliszewski1 C, Lelescu A, He B, Cui J, Kreulen J, Rhodes J, Spangler WS. Business Insights
Workbench – An Interactive Insights Discovery Solution. Lecture Notes in Computer Science Volume 4558.
2007. Pp. 834-843.
7. Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh
International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia.
8. Borko H., and Bernick M. Automatic document classification. Journal of the ACM, 10, 2, (April 1963), 151-162.
9. Borko H., and Bernick M. Automatic document classification part II: additional experiments. Journal of the
ACM, 11, 2, (April 1964), 138-151.
10. Bhandarkar A.; Chandrasekar R.; Ramani S.; and Bhatnagar A. Intelligent categorization, archival and retrieval of
information. In, Proceedings of the International Conference on Knowledge Based Computer Systems (KBCS ’89),
Bombay, India, December 11-13, 1989, Lecture Notes in Computer Science, 444, Springer, 1990, pp. 309-320.
11. Blair D.C. Searching biases in large interactive document retrieval systems. Journal of the American Society for
Information Science, 31, (July 1980), 271-277.
12. Bonferroni, C. E. “Il calcolo delle assicurazioni su gruppi di teste.” In Studi in Onore del Professore Salvatore
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 51 of 93
Ortu Carboni. Rome: Italy, pp. 13-60, 1935.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 52 of 93
13. Bonferroni, C. E. “Teoria statistica delle classi e calcolo delle probabilità.” Pubblicazioni del R Istituto Superiore
di Scienze Economiche e Commerciali di Firenze 8, 3-62, 1936.
14. Butler D. Mashups mix data into global service. Nature, 439, (4 January 2006), 6-7.
15. Bot R.S.; Wu Y.B.; Chen X.; and Li Q. A hybrid classifier approach for web retrieved documents classification.
In, International Conference on Information Technology: Coding and Computing (ITCC), 2004, pp. 326-330.
16. Bot R.S.; Wu Y.B.; Chen X.; and Li Q. Generating better concept hierarchies using automatic document
classification. In, Conference on Information and Knowledge Management, 2005, pp. 281-282.
17. Chen H., and Dumais S. Bringing order to the web: automatically categorizing search results. In, Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems. The Hague, The Netherlands. April 1-6, 2000,
pp. 145-152.
18. Chim H., and Deng X. A new suffix tree similarity measure for document clustering. In, 16th International World
Wide Web Conference (WWW2007), Banff, Alberta, Canada, May 8-12, 2007, pp. 121-130.
19. Chen H., and Ho T.K. Evaluation of decision forests on text categorization. In, Proceedings of the 7th SPIE
Conference on Document Recognition and Retrieval, 2000, pp. 191-199.
20. Cutting D.R.; Karger D.R.; Pedersen J.O.; and Tukey J.W. Scatter/Gather: a cluster-based approach to browsing
large document collections. In, Proceedings of the 15th annual international ACM SIGIR conference on Research
and development in information retrieval. Copenhagen, Denmark, June 21-24, 1992, pp. 318-329.
21. Calvo R.A.; Lee J.M.; and Li X. Managing content with automatic document classification. Journal of Digital
Information, 5, 2, (2004), 1-15.
22. Chen M.; LaPaugh A.; and Singh J.P. Categorizing information objects from user access patterns. In, Proceedings
of the Eleventh International Conference on Information and Knowledge Management, 4-9 November 2002, pp.
365-372.
23. Croft W.B. Clustering large files of documents using the single link method. Journal of the American Society of
Information Science, 28, (1977), 341-344.
24. Croft W.B. Organizing and Searching Large Files of Documents. PhD thesis, University of Cambridge. (1978).
25. Cohen W.W., and Singer Y. Context-sensitive learning methods for text categorization. ACM Transactions on
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 53 of 93
Information Systems, 17, 2, (1999), 141-173.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 54 of 93
26. Chen H.; Schuffels C.; and Orwig R. Internet categorization and search: a self-organizing approach. Journal of
Visual Communication and Image Representation, Special Issue on Digital Libraries, 7, 1, (1996), 88-102.
27. Chau R.; Yeh C.; and Smith K.A. A neural network model for hierarchical multilingual text categorization.
Advances in Neural Networks – ISNN 2005, Lecture Notes in Computer Science, 3497, (2005), 238-245.
28. Chung W, Chen H. and Nunamaker J. A Visual Framework for Knowledge Discovery on the Web: An Empirical
Study of Business Intelligence Exploration. Journal of Management Information Systems. 21(4). Spring 2005.
Pp. 57 – 84.
29. Dumais S., and Chen H. Hierarchical classification of web content. In, Proceedings of the 23rd Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval . Athens, Greece,
July 24-28, 2000, pp. 256-263.
30. Dumais S.; Cutrell E.; and Chen H. Optimizing search by showing results in context. In, Proceedings of the
SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA, March 2001, pp. 277-284.
31. Dworman G.O.; Kimbrough S.O.; and Patch C. On pattern-directed search of archives and collections. Journal of
the American Society for Information Science, 51, 1, (2000), 14-23.
32. Dagan I.; Karov Y.; and Roth D. Mistake-driven learning in text categorization. In, Cardie C. and Weischedel R.
(eds.), Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, 1997, pp.
55-63.
33. Datta A., and Thomas H. The cube data model: a conceptual model and algebra for on-line analytical processing
in data warehouses. Decision Support Systems, 27, 3, (1999), 289-301.
34. Eklund P.W., and Cole R.J. Information classification and retrieval using concept lattices. United States Patent
20060112108.
35. Eder J.; Krumpholz A.; Biliris A.; and Panagos E. Self-maintained folder hierarchies as document repositories. In,
Proceedings of the 2000 Kyoto International Conference on Digital Libraries: Research and Practice , Kyoto,
Japan, November 13-16, 2000, pp. 356-363.
36. Efthimiadis E.N. Query expansion. In, Martha E. Williams (ed.), Annual Review of Information Systems and
Technology (ARIST), 31, 1996, pp. 121-187.
37. Farkas J. Generating document clusters using thesauri and neural networks. In, Canadian Conference on
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 55 of 93
Electrical and Computer Engineering, 1994, pp. 710-713.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 56 of 93
38. Fellbaum C. (ed.). WordNet: An electronic lexical database. Cambridge, Massachusetts: Bradford Books / MIT
Press, 1998.
39. Ferrari AJ, Gourley DJ. Johnson KA, Knabe FC, Mohta VB, Tunkelang D, and Walter JS. Hierarchical data-
driven search and navigation system and method for information retrieval. US Patent 7062483. June 13, 2006.
40. Fürnkranz J. Exploiting structural information for text classification on the WWW. In, Proceedings of the Third
International Symposium on Advances in Intelligent Data Analysis, August 1, 1997, pp. 487-498.
41. Geffner S.; Agrawal D.; El Abbadi A.; and Smith T. Browsing large digital library collections using classification
hierarchies. In, Proceedings of the Eighth International Conference on Information and Knowledge Management,
Kansas City, MO, November 2-6, 1999, pp. 195-201.
42. Gray J.; Chaudhuri S.; Bosworth A.; Layman A.; Reichart D.; and Venkatrao M.. Data cube: a relational
aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Mining and Knowledge Discovery, 1,
1, (1997), 29-53.
43. Gietz P. Report on automatic classification systems for the TERENA activity Portal Coordination. (19 June
2001). Available at: www.daasi.de/reports/Report-automatic-classification.html
44. Goren-Bar D., and Kuflik T. Supporting user-subjective categorization with self-organizing maps and learning
vector quantization. Journal of the American Society for Information Science and Technology, 56, 4, (2005), 345-
355.
45. Goren-Bar D.; Kuflik T.; and Lev D. Supervised learning for automatic classification of documents using self-
organizing maps. In, Proceedings of the First DELOS Network of Excellence Workshop on “Information Seeking,
Searching and Querying in Digital Libraries”, Zurich, Switzerland, December 11-12, 2000.
46. Garfield E.; Malin MV.; and Small H. A system for automatic classification of scientific literature. Journal of the
Indian Institute of Science, 57, 2, (1975), 61-74.
47. Golub K. Automated subject classification of web documents. Journal of Documentation, 62, 3, (2006), 350-371.
48. Godby J., and Stuler J. The Library of Congress Classification as a knowledge base for automatic subject
categorization. In, Subject Retrieval in a Networked Environment (IFLA Preconference), Dublin, Ohio, August
2001.
49. Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. KNN model-based approach in classification. In, Proceedings of
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 57 of 93
the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE 2003), Catania,
Sicily, Italy, 3-7 November 2003. Lecture Notes in Computer Science, 2888, Springer-Verlag, 2003, pp. 986-996
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 58 of 93
50. Guo G.; Wang H.; Bell D.; Bi Y.; and Greer K. An kNN model-based approach and its application in text
categorization. In, Proceedings of the 5th International Conference on Intelligent Text Processing and
Computational Linguistics (CICLing),2004, Lecture Notes in Computer Science, 2945, Springer-Verlag, pp. 559-
570.
51. Hearst, M. Clustering versus Faceted Categories for Information Exploration. Communications of the ACM. 49
(4). April 2006.
52. Hearst M, English J, Sinha R, Swearingen K, and Yee P. Finding the Flow in Web Site Search. Communications
of the ACM. 45 (9). September 2002. Pp.42-49.
53. Holland JM, Kreulen JT, and Spangler WS. Method and system for identifying relationships between text
documents and structured variables pertaining to the text documents. US Patent 7155668. December 26, 2006.
54. Huffman S., and Damashek M. Acquaintance: a novel vector-space n-gram technique for document categorization.
In, Proceedings of the 3rd Text Retrieval Conference (TREC 3), 1994, pp. 305-310.
55. Hatzivassiloglou V.; Gravano L.; and Maganti A. An investigation of linguistic features and clustering algorithms
for topical document clustering. In, Proceedings of the 23rd Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, Athens, Greece, 2000, pp. 224-231.
56. Hussin M.F., and Kamel M. Document clustering using hierarchical SOMART neural network. In, Proceedings of
the International Joint Conference on Neural Networks, 20-24 July 2003, pp. 2238 – 2242.
57. Hayes P.; Knecht L.E.; and Cellio M.J. A news story categorization system. In, Second Conference on Applied
Natural Language Processing (ANLP-88), 1988, pp. 9-17. Reprinted in Sparck-Jones K., and Willett P. (eds.),
Readings in Iinformation Retrieval, San Francisco, CA: Morgan Kaufmann, 1997, pp. 518-526.
58. Han E.H.; Karypis G.; and Kumar V. Text categorization using weight adjusted k-nearest neighbor classification.
In, Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001, pp. 53-65.
59. Hofmann T. The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In,
Proceedings of the International Joint Conference in Artificial Intelligence, 1999, pp. 682 – 687.
60. Huynh D, Mazzochi S, and Karger D. Piggy Bank: Experience the Semantic Web Inside your Web Browser.
Journal of Web Semantics. 5(1). 2007. Pp. 16-27.
61. Iwayama M., and Tokunaga T. Hierarchical Bayesian clustering for automatic text classification. In, Proceedings
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 59 of 93
of the International Joint Conference in Artificial Intelligence, 1995, pp. 1322-1327.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 60 of 93
62. Ide N. and Veronis J. Word sense disambiguation: the state of the art. Computational Linguistics, 24, 1, (1998), 1-
40.
63. Jenkins C.; Jackson M.; Burden P.; and Wallis J. Automatic classification of web resources using Java and Dewey
decimal classification. Computer Networks and ISDN Systems, 30, (1998), 646-648.
64. Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In,
Proceedings of the Fourteenth International Conference on Machine Learning, July 8-12, 1997, pp. 143-151.
65. Joachims T. Text categorization with Support Vector Machines: Learning with many relevant features. In,
Machine Learning: ECML-98, Tenth European Conference on Machine Learning, 1998, pp. 137-142.
66. Jardine N., and van Rijsbergen C.J. The user of hierarchical clustering in information retrieval. Information
Storage and Retrieval, 7, (1971), 217-240.
67. Joachims T., and Sebastiani F. (eds.). Automated text categorization (special issue), Journal of Intelligent
Information Systems, 18, (March-May 2002), 2-3.
68. Käki M. Findex: search result categories help users when document ranking fails. In, Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, Portland, OR, April 2-7, 2005, pp. 131-140.
69. Ko S.J.; Choi J.H.; and Lee J.H. Bayesian web document classification through optimizing association word. In,
Proceedings of the 15th International Conference in Applied Artificial Intelligence, Laughborough, UK, Lecture
Notes in Computer Science, 2718, (2003), 565-574.
70. Koch T.; Day M.; Brümmer A.; Hiom D.; Peereboom M.; Poulter A.; and Worsfold E. The role of classification
schemes in internet resource description and discovery. In, Work Package 3 of Telematics for Research project
Development of a European Service for Information on Research and Education (DESIRE) (RE 1004), 1999.
71. Kendall M. A New Measure of Rank Correlation. Biometrika, 30, (1938), 81-89.
72. Kendall M. Rank Correlation Methods. London: Charles Griffin & Company Limited, 1948.
73. Knepper M.M.; Fox K.L.; and Frieder O. Method for domain identification of documents in a document database.
United States Patent 20060206483.
74. Kules B.; Kustanowitz J.; and Shneiderman B. Categorizing web search results into meaningful and stable
categories using fast-feature techniques. In, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 61 of 93
Libraries, Chapel Hill, NC, June 11-15, 2006, pp. 210-219.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 62 of 93
75. Koller D., and Sahami M. Rule-based hierarchical document categorization for the World Wide Web. In,
Proceedings of the 14th International Conference on Machine Learning, 1997, pp. 170-178.
76. Kita K.; Sasaki M.; and Ying T.X. Rule-based hierarchical document categorization for the World Wide Web. In,
Asia Pacific Web Conference, 1998, pp. 269-273.
77. Kummamuru K.; Lotlikar R.; Roy S.; Singal K.; and Krishnapuram R. A hierarchical monothetic document
clustering algorithm for summarization and browsing search results. In, Proceedings of the 13th international
conference on World Wide Web, New York, NY, May 17-20, 2004, pp. 658-665.
78. Li X., and Calvo R.A. Hierarchical document classification using I bayes. In, 8th Australasian Document
Computing Symposium, CSIRO, Canberra, December 2003.
79. Leouski A.V., and Croft W.B. An Evaluation of Techniques for Clustering Search Results. Technical Report IR-
76, Amherst: Department of Computer Science, University of Massachusetts, 1996.
80. Larkey L., and Croft W.B. Combining classifiers in text categorization. In, Proceedings of the 19th International
Conference on Research and Development in Information Retrieval, 1996, pp. 289-297.
81. Li Q.; Chen X.; Bot R.S.; and Wu Y.B. Improving concept hierarchy development for web returned documents
using automatic classification. In, International Conference on Internet Computing, 2005, pp. 99-105.
82. Labrou Y., and Finin T. Yahoo! As an ontology: using Yahoo! Categories to describe documents. In, Proceedings
of the 8th International Conference on Information and Knowledge Management (CIKM-99), Kansas City, MO,
November 2-6, 1999, pp. 180-187.
83. Lewis D.D., and Gale W.A. A sequential algorithm for training text classifiers. In, Croft W.B. and van Rijsbergen
C.J. (eds.), Proceedings of the 17th ACM International Conference on Research and Development in Information
Retrieval (SIGIR ’94), Dublin, Ireland, 1994, pp. 3-12.
84. Lam W., and Ho C.Y. Using a generalized instance set for automatic text categorization. In, Proceedings of the
21st International Conference on Research and Development in Information Retrieval (SIGIR ’98), Melbourne,
Australia, 1998, pp. 81-89.
85. Liang J.Z. SVM multi-classifier and Web document classification. In, Proceedings of the 2004 International
Conference on Machine Learning and Cybernetics, Volume 3, August 26-29, 2004, pp. 1347-1351.
86. Li Y.H., and Jain A.K. Classification of text documents. The Computer Journal, 41, 8, (1998), 537-546.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 63 of 93
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 64 of 93
87. Li Y., and Lan Z. A survey of load balancing in grid computing. Lecture Notes in Computer Science, 3314,
(2005), 280-285.
88. Li W.; Lee B.; Krausz F.; and Sahin K. Text classification by a neural network. In, Proceedings of the. 23rd
Annual Summer Computer Simulation Conference, Baltimore, MD July 22-24, 1991, pp. 313-318.
89. Lewis D.D., and Ringuette M. A comparison of two learning algorithms for text categorization. In, Third Annual
Symposium on Document Analysis and Information Retrieval, 1994, pp. 81-93.
90. Lehnert W.S.; Soderland S.; Aronow D.; Feng F.; and Shmueli A. Inductive text classification for medical
applications. Journal for Experimental and Theoretical Artificial Intelligence, 7, 1, (1995), 49–80.
91. Lewis D.D.; Schapire R.E.; Callan J.P.; and Papka R. Training algorithms for linear text classifiers. In,
Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, Zurich, Switzerland, August 18-22, 1996, pp. 298-306.
92. Lin S.H.; Shih C.S.; Chen M.C.; Ho J.M.; Ko M.T.; and Huang Y.M. Extracting classification knowledge of
internet documents with mining term associations: a semantic approach. In, Proceedings of the 21st Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval , Melbourne,
Australia, August 24-28, 1998, pp. 241-249.
93. Lewis D.D.; Yang Y.; Rose T.G.; and Li F. RCV1: A new benchmark collection for text categorization research.
Journal of Machine Learning Research, 5, (2004), 361-397.
94. MacLeod K. An application specific neural model for document clustering. In, Proceedings of the Fourth Annual
Parallel Processing Symposium, 1, 1990, pp. 5-16.
95. Marchionni G. Exploratory search: from finding to understanding. Communications of the ACM. 49 (4). April
2006. Pp 41-46.
96. McGarry K. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review,
20, 1, (March 2005), 39-61.
97. Möller G.; Carstensen K.U.; Diekman B.; and Watjen H. Automatic classification of the World Wide Web using
Universal Decimal Classification. In, McKenna B (ed.), 23rd International Online Information Meeting(London,
England), Oxford: Learned Information Europe, 1999, pp. 231-237.
98. Markov A., and Last M. A simple, structure-sensitive approach for web document classification. Advances in
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 65 of 93
Web Intelligence. Lecture Notes in Computer Science, 3528, (2005), 293-298.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 66 of 93
99. Maarek Y.S., and Wecker A.J. The Librarian’s Assistant: automatically organizing on-line books into dynamic
bookshelves. In, Proceedings of Intelligent Multimedia Information Retrieval Systems and Management (RIAO
’94), New York, NY, October 11-13, 1994.
100. Nigam K.; McCallum A.; Thrun S.; and Mitchell T. Text classification from labeled and unlabeled documents
using EM. Machine Learning, 39,2/3, (2000), 103-134.
101. Papka, R., and Allan J. Document classification using multiword features. In, Proceedings of the 7th International
Conference on Information and Knowledge Management, Bethesda, MD, 1998, pp. 124-131.
102. Pearson K. Mathematical contributions to the theory of evolution. III. Regression, heredity and panmixia.
Philosophical Transactions of the Royal Society of London, 187, (1896), 253-318.
103. Pierre J.M. On the automated classification of web sites. Linkoping Electronic Articles in Computer and
Information Science, 6, 1, (2001), 1-12.
104. Pollitt S. The key role of classification and indexing in view-based searching. 63rd IFLA General Conference and
Council. International Federation of Library Associations and Institutions (IFLA). Copenhagen, Denmark. 31
August – 5 September 1997.
105. van Rijsbergen C.J. Information Retrieval, 2nd Edition. London: Butterworths, 1979.
106. Riloff E. Little words can make a big difference for text classification. In, Proceedings of the 18th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval , Seattle, WA, July
9-13, 1995, pp. 130-136.
107. Riloff E. Using learned extraction patterns for text classification. In, Wermter S., Riloff E., and Scheler G. (eds),
Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Berlin:
Springer-Verlag, 1996, pp. 275-289.
108. Riloff E., and Lehnert W. Classifying texts using relevancy signatures. In, Proceedings of the Tenth National
Conference on Artificial Intelligence, 1992, pp. 329-334.
109. Riloff E., and Lehnert W. Information extraction as a basis for high-precision text classification. ACM
Transactions on Information Systems, 12, 3, (July 1994), 296-333.
110. Rocchio J.J. Relevance feedback in information retrieval. In, Salton G. (ed.), The SMART Retrieval System:
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 67 of 93
Experiments in Automatic Document Processing, Englewood Cliffs, NJ: Prentice-Hall, 1971, pp. 313-323.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 68 of 93
111. Rodden K. About 23 million documents match your query… In, ACM Conference on Human Factors in
Computing Systems (ACM CHI’98), Los Angeles, CA, April 1998, pp. 64-65.
112. Romano NC, Donovan C, Chen H, and Nunamaker J. A Methodology for Analyzing Web-Based Qualitative Data.
Journal of Management Information Systems. 19(4). Spring 2003. Pp. 213 – 246.
113. Ruiz M., and Srinivasan P. Hierarchical text categorization using neural networks. Information Retrieval, 5, 1,
(2002), 87-118.
114. Schraefel MMC, Wilson M, Russell A, and Smith DA: mSpace: improving information access to multimedia
domains with multimodal exploratory search. Communications of the ACM. 49(4). April 2006. Pp. 47-49.
115. Siegel S., and Castellan N.J. Nonparametric Statistics for the Behavioral Sciences, 2nd edition. London: McGraw-
Hill, 1988.
116. Schutze H. Automatic word sense discrimination. Computational Linguistics, 24, 1, (1998), 97-123.
117. Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 34, 1, (2002), 1-47.
118. Sebastiani F. Text categorization. In, Alessandro Zanasi (ed.), Text Mining and its Applications to Intelligence,
CRM and Knowledge Management, Southampton, UK : WIT Press, , 2005, pp. 109-129.
119. Shirazi B.A.; Kavi K.M.; and Hurson A. Scheduling and Load Balancing in Parallel and Distributed Systems.
Los Alamitos, CA: IEEE Computer Society Press, 1995.
120. Shafer K.E. Scorpion helps catalog the Web. Bulletin of the American Society for Information Science, 24, 1,
(October/November 1997), 28-29.
121. Spangler S and Kreulen J. Mining the Talk: Unlocking the Business Value in Unstructured Information. IBM
Press. 2008.
122. Spangler S, Kreulen JT, and Lessler J. Generating and Browsing Multiple Taxonomies Over a Document
Collection. Journal of Management Information Systems. 19(4). Spring 2003. Pp. 191 – 212
123. Spearman C. The proof and measurement of association between two things. American Journal of Psychology, 15,
(1904), 72–101. Reprinted in: The American Journal of Psychology, 100, ¾, Special Centennial Issue, (Autumn –
Winter, 1987), 441-471.
124. Slonim M., and Tishby N. The power of word clusters for text classification. In, Proceedings of ECIR-01, 23rd
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 69 of 93
European Colloquium on Information Retrieval Research, Darmstadt, Germany, 2001, pp. 1-12.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 70 of 93
125. Sun A.; Lim E.P.; and Ng W.K. Hierarchical text classification and evaluation. In, IEEE International
Conference on Data Mining (ICDM), San Jose, CA, Nov 29-Dec 2, 2001, pp. 521-528.
126. Svingen B. Using genetic programming for document classification. In, Proceedings of the 11th International
Florida Artificial Intelligence Research Society Conference (FLAIRS98), 1998, pp. 63-67.
127. Toth E. Innovative solutions in automatic classification: a brief summary. Libri, 52, 1, (2002), 48-53.
128. Thompson R.; Shafer K.E.; and Vizine-Goetz D. Evaluating Dewey Concepts as a Knowledge Base for Automatic
Subject Assignment. In, 2nd ACM International Conference on Digital Libraries, Philadelphia, PA, 1997, pp. 37-
46.
129. Vlajic N., and Card H.C. Categorizing Web pages using modified ART. In, Canadian Conference on Electrical
and Computer Engineering, Volume 1, 1998, pp. 313-316.
130. Wang Y.; Hodges J.; and Tang B. Classification of Web documents using a naïve Bayes method. In, Proceedings
of the 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), 2003, pp. 560.
131. Wei C-P, Chiang RHL, and Wu CC. Accommodating Individual Preferences in the Categorization of Documents:
A Personalized Clustering Approach. Journal of Management Information Systems. 23(2). Fall 2006. pp. 173 –
201.
132. Wei C-P, Hu PJ, and Le Y-H. Preserving User Preferences in Automated Document-Category Management: An
Evolution-Based Approach. Journal of Management Information Systems. 25(4). Spring 2009. pp. 109 – 143.
133. Willet P. Recent trends in hierarchic document clustering: A critical review. Information Processing and
Management, 24, 5, (1988), 577-597.
134. Worsfold E. Subject gateways – fulfilling the DESIRE for knowledge. Computer Networks and ISDN Systems, 30,
16, (30 September 1998), 1479-1489.
135. Wiener E.D.; Pedersen J.O.; and Weigend A.S. A neural network approach to topic spotting. In, Proceedings of
4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-95), Las Vegas, NV, 1995, pp.
317-332.
136. Wu Y.B.; Shankar L.; and Chen X. Finding more useful information faster from web search results. In,
Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, LA,
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 71 of 93
2003, pp. 568–571.
137. Yang Y. An evaluation of statistical approaches for text categorization. Journal of Information Retrieval, 1, 1-2,
(1999) 67-88.
138. Yang Y., and Liu X. A re-examination of text categorization methods. In, Hearst M.A., Gey F., and Tong R. (eds.),
Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information
Retrieval, Berkeley, CA: ACM Press, 1999, pp. 42-49.
139. Yang Y.; Slattery S.; and Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent
Information Systems, 18, 2-3, (March, 2002), 219-241.
140. Zamir O., and Etzioni O. Web document clustering: a feasibility demonstration. In, Proceedings of the 21st Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval , Melbourne,
Australia, August 24-28, 1998, pp. 46–54.
141. Zamir O., and Etzioni O. Grouper: a dynamic clustering interface to Web search results. Proceeding of the Eighth
International Conference on World Wide Web, Toronto, Canada, May 1999, pp. 1361-1374.
142. Zamir O.; Korn J.; Fikes A.; and Lawrence S. Personalization of placed content ordering in search results. United
States Patent Application 0050250580. Patent ID EP 1782286A1. Issued May 9, 2007.
143. Zeng H.J.; He Q.C.; Chen Z.; Ma W.Y.; and Ma J. Learning to cluster web search results. In, Proceedings of the
27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,
Sheffield, UK, July 25-29, 2004, pp. 210-217.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 72 of 93
FIGURES
Figure 1: Results produced by exploratory search on ‘fishing’ using clusty.com
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 73 of 93
Figure 2: Results produced by exploratory search on ‘fishing’ using kartoo.com
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 74 of 93
Figure 3: Results produced by exploratory search on ‘US locations fishing’ using
clusty.com
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 75 of 93
Figure 4: Hits for ‘fishing’ by state, from a CDB populated with American cities
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 76 of 93
Figure 5: Population of categories with relevant documents
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 77 of 93
Figure 6: Screen from our software prototype, showing word suggestion tool to allow user
to select an expansion or disambiguation of the current search term
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 78 of 93
Figure 7: Computing hits for each category
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 79 of 93
Figure 8: Relative prevalence of the terms “smoothness”, “strength” and [“wet” or
“damp”] in various stone quarrying industry segments. Three bars are shown for each
industry: from left to right, the three bars for that industry are “smoothness” for that
industry, “strength” for that industry, and [“wet OR damp”] for that industry.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 80 of 93
Figure 9: Excel-based collapsible tree view provided for the exploration of
aggregate statistics (e.g. hits) per category in various taxonomies
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 81 of 93
Figure 10: Web-based collapsible tree view provided for the exploration of
aggregate statistics (e.g. hits) per category, in the UNSPSC taxonomy
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 82 of 93
Figure 11: Collaborative annotation interface for sharing of human observations on interesting
categories amongst a team: illustration of users sharing comments on possible applications of a
biodegradable molecule with foam reduction properties.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 83 of 93
CategoriesColumn Name Data TypeParentID Int(11) (Primary Key)CategoryID Int(11)CategoryName Char(255)Flag Int(11)DateCompleted DateTime
CategoryAssignmentColumn Name Data TypeCategoryID Int(11) (Primary Key)DateAndTimeAssigned TimestampIPAddressOfServant Char(255)ServantComputerName Char(255)
CategoryCompletedColumn Name Data TypeCategoryID Int(11) (Primary Key)DateAndTimeCompleted TimestampIPAddressOfServant Char(255)ServantComputerName Char(255)
DocumentsColumn Name Data TypeDocumentID Int(11) (Primary Key)DocumentURL TextCategoryID Int(11)DateCompleted DateTime
LexiconColumn Name Data TypeWordSenseID Int(11) (Primary Key)WordText Char(255)
WordsColumn Name Data TypeDocumentID Int(11) (Primary Key)WordSenseID Int(11)WordPosition Int(11)
Figure 12: Database tables used to represent and index documents in the CDB
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 84 of 93
RequestColumn Name Data TypeRequestID Int(11) (Primary Key)UserID Char(255)TopCategoryID Int(11)SearchPhrase Char(255)DateAndTimeRequested TimeStampDateAndTimeAssigned DateTimeDateAndTimePopulated DateTime
RequestedDataColumn Name Data TypeRequestID Int(11) (Primary Key)CategoryID Int(11)SearchPhrase Char(255)WordCount Int(11)DocumentCount Int(11)Servant Char(255)DateAndTime DateTime
Figure 13: Database tables used to store and cache query results from the CDB
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 85 of 93
UserColumn Name Data TypeUserID Int(11) (Primary Key)Username Char(255)Company Char(255)Email Char(255)
ProjectColumn Name Data TypeProjectID Int(11) (Primary Key)Project Name Char(255)CreatedByUserID Int(11)
AnnotationColumn Name Data TypeAnnotation_ID Int(11) (Primary Key)Category_ID Int(11)Project_ID Int(11)CreatedByUserID Int(11)Application Varchar(500)Notes Varchar(600)Review Varchar(60)Rank Int(11)DateAndTime DateTime
SharedAnnotationColumn Name Data TypeAnnotationID Int(11) (Primary Key)SharedWithUserID Int(11)
Figure 14: Database tables used for collaborative annotation features of the CDB
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 86 of 93
Figure 15: Bubble Chart Showing Integration of
Aggregate Data Gleaned From Text, with Structured Data.
X-Axis represents the Asset Turnover for the industry (i.e. category).
Y-Axis is relative prevalence of the search term “biodegradable” in each category.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 87 of 93
.
Figure 16: Sample of publicly traded companies in the plastics packaging industry,
obtained from the SEC, after a CDB exploration revealed that documents in the plastics
packaging industry frequently mention a compound being marketed by a chemical
industry salesperson.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 88 of 93
TABLES
Type of Taxonomy Examples
Product hierarchies United Nations Standard Products and Services Code (UNSPSC) *
United Nations Central Product Classification (CPC) United States Patent Classification (USPTO USPC) *
International Patent Classification (IPC) Proprietary corporate product catalogues (e.g. Amazon, Wal-Mart, Sears, or any other
catalogue defined by any large or small company)
Industry taxonomies North American Industry Classification Scheme (NAICS) †
United States Standard Industrial Classifications (SIC) †
International Standard Industrial Classification (ISIC) SITC3 (Standard International Trade Classification)
Company classifications
Fortune 500 † and Fortune 1000 S&P 500 †
Inc. 500 † and Inc. 5000 Entrepreneur Magazine’s Franchise 500 Internet Retailer 500
Activity taxonomies WordNet (Verb relationships) †
United States Bureau of Labor Statistics Standard Occupation Classification System (SOC) †
Place (Location) taxonomies
United States General Services Administration Geographic Locator Codes (US GSA GLC) / Geographic Names Service *
United States Direct Marketing Areas (DMA) †
Getty Thesaurus of Geographic Names (TGN)
Time taxonomies ISO †
Topic taxonomies Library of Congress Classification system (LoC) †
UK Joint Academic Coding System (JACS) †
UK Higher Education Standard Authority Coding (HESACODE) †
Medical taxonomies International Classification of Diagnoses (e.g. ICD10) †
International Classification of Primary Care (ICPC) Current Procedural Terminology (CPT) US FDA Classification of Medical Devices †
Table 1: Popular Classification Schemes† indicates the taxonomy (category names and relations) was imported into our prototype system,
and a random selection of approximately 10% of categories were populated with documents* indicates that the taxonomy was imported into our prototype system and all categories were populated with documents
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 89 of 93
Table 2: Absolute Hits for a Number of Search Terms, by Document Category
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 90 of 93
Type of Taxonomy Examples of data indexed using standard taxonomies
Product hierarchies Sales data for each product category, from an internal company database, indexed by product
category (e.g. UNSPSC, or UCC Stock Keeping Unit [SKU]).
Industry taxonomies Industry size figures from the Bureau of Economic Analysis (BEA.gov), or from the Internal
Revenue Service (IRS.gov), indexed by NAICS code.
Company classifications Company profit figures, from the Securities and Exchange Commission (SEC.gov), indexed
by NAICS code.
Activity / Employee
taxonomies
Salary data for each profession, from the Bureau of Labor Statistics (BLS.gov), indexed by
SOC occupation classification.
Place (Location)
taxonomies
Population, land area, and other geographic data from the United States Geological Survey
(USGS.gov), indexed by Geographic Locator Code (GLC).
United States General Services Administration Geographic Locator Codes (US GSA GLC).
Time taxonomies Sales data for each date, from an internal company database, indexed by time.
Topic taxonomies Enrollment data for each academic subject, from the National Center for Education Statistics
(NCES.ed.gov), indexed by educational field.
Medical taxonomies Infection rate, for each illness, in each area, indexed by ICD9 or ICD10 disease code.
Table 3: Structured Data Sources for Various Taxonomies
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 91 of 93
Corporate Franchise
Total US Franchise
OutletsCDB Search Terms Used
Pearson’s : Per-capita franchise outlets per
state vs CDB search term frequency for state 44
Population Correlation 45
McDonalds 11,318“burger” -0.25
0.98“hamburgers” -0.15
Pizza Hut 5,676 “pizza” 0.04 0.34
KFC 4,378 “chicken” -0.17 0.95
Intercontinental 3,023 “hotels” -0.34 0.93
Starbucks 9,869 “coffee” -0.21 0.90
RE/MAX 4,628 “property” 0.15 0.95
Supercuts 1,644 “hair” -0.35 0.79
Jackson Hewitt 2,475 “tax” 0.10 0.87
Carlson Wagonlit 340“travel” 0.26
0.87“flight” 0.13
Jiffy Lube 1,923 “car” -0.10 0.82
Miracle Ear 1,349 “hearing” 0.02 0.89
Table 4: Summary of Experimental Results – Selected Population-Sensitive Industries
44 For each study shown in Table 5, fifty one (51) pairs of data points – one pair of rankings for each U.S. state, including the District of Columbia – were compared. For 51 data points (the 51 states compared in each case), the Pearson score required for statistical significance at the weaker 90% confidence level is > 0.231, and at the stronger 99% confidence level the Pearson score threshold required is > 0.354. Therefore, in all cases where > 0.354 we can conclude that it is highly unlikely that the correlation between the two rankings being compared occurred by chance.
45 ‘Population Correlation’ is the Pearson -value found when correlating the ranking for the states by the US Franchise outlets in that industry with the ranking of the states by their population. Population figures were obtained from the US Census Bureau.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 92 of 93
Industry External Data Used CDB Search Term Used
Pearson’s 46
Population Correlation 47
Wind energy
DoE wind generating capacity
“windy”0.07 0.10
NREL wind resource availability 0.25 -0.36
Solar energy Thermomax solar energy (BTUs)
“warm” -0.11
-0.09“sunny” -0.28
“sunshine” 0.22
Rain
NOAA precipitation per square mile 2008
“rain”0.29 -0.02
NationalAtlas.gov 1961-1990 0.27 0.01
Fishing USFWS Non-resident fishing licenses sold “fishing” 0.46 0.19
CoalNMA Number of coal mines
“coal”0.75 0.18
NMA Coal production 0.74 0.08
Gemstone NMA Gemstone production “gemstone” 0.30 0.09
Gold NMA Gold revenues “gold” 0.29 -0.06
Forests NFS Forest area “forest” 0.30 0.18
Oil EIA Oil production “oil” 0.39 -0.04
Mountain climbing USGS Elevation Data “mountain climbing” 0.65 -0.10
Eco-tourism USBLS Number of eco-tourism employees “ecotourism” 0.39 0.35
Gaming USBLS Number of game dealers “gambling” 0.29 0.32
Table 5: Summary of Experimental Results – Non-Population-Sensitive Industries
46 Pearson’s here is the correlation obtained by comparing the state ranking using external data to the CDB ranking of states by search term frequency (search term hits per 1,000 words).
47 Population Correlation is the Pearson -value found when correlating the ranking for the states from the external data with the ranking of the states by their population. Population figures were obtained from the US Census Bureau. Note that the eco-tourism and gaming industries display some population sensitivity, though much milder than the population-sensitive industries studied in Table 4.
Creating and exploiting aggregate information from text using a novel Categorized Document Base
Page 93 of 93