11
WHITE PAPER Conceptualizing LSI-Based Text Analytics John Felahi Senior Vice President of Products, Content Analyst, LLC. Trevor J. Morgan, Ph.D. Relevance Analytics

Content Analyst - Conceptualizing LSI Based Text Analytics White Paper

Embed Size (px)

Citation preview

white paper

Conceptualizing LSI-Based Text Analytics

John FelahiSenior Vice President of Products, Content Analyst, LLC.

Trevor J. Morgan, Ph.D.Relevance Analytics

1© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

A Better Way to Find Information

Corporations are undeniably overloaded with information. They thrive on the innovation and intellectual property captured in their corporate documents. However, when document sets grow at explosive rates like they have over the past five years, critical information becomes buried and lost. The unfortunate reality is that lost information is useless information. Companies lose a key competitive advantage by not being able to find and exploit valuable information embedded within large document sets.1

The traditional solution to this problem is Boolean (keyword) search or keyword-based document tagging and organization. These Boolean search and analytics tools may work just fine when the user knows exactly what word or words the desired documents must contain, but this is rarely the case. Nobody can ever anticipate what specific words or phrases are in a document—finding the right query is laborious and often futile.

The goal, then, is to make information more findable by complementing keyword search systems with advanced text analytics technologies. Advanced text analytics encompasses the complex convergence of linguistics, mathematics, probability theory, and software development. Advanced text analytics software employs sophisticated algorithms in an attempt to “read” document content and figure out what that content actually means. These solutions provide users with a rich array of features, including concept-based search and document organization functions. Advanced text analytics software tries to determine document content and meaning in the same way humans do, except on a scale of volume far beyond human capabilities. The purpose is not to replace humans but rather to refocus humans on what they do best.2

“ TexT ANALYTICS

SOFTWARe eMPLOYS

SOPHISTICATeD ALGORITHMS

IN AN ATTeMPT TO "ReAD"

DOCUMeNT CONTeNT

AND FIGURe OUT WHAT

THAT CONTeNT ACTUALLY

MeANS.”

2© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

Differentiators in Text Analytics

Text analytics solutions distinguish themselves in two ways:

1. The manner in which the engine discovers the meaning (concepts) of text in a document set. This is essentially the content discovery aspect of text analytics.

2. The variety of features that end users can leverage once the engine has discovered all document content.

Enough stress cannot be placed on the content discovery phase. The following sections focus on how different technologies approach the daunting challenge of discovering the meaning embedded within text documents. The discussion shows the natural evolution of indexing technologies to the present generation of advanced text analytics engines, which deal with document conceptuality. It also points to an extremely viable advanced text analytics technology known as Latent Semantic Indexing (LSI), discusses how LSI functions, and compares it to alternative technologies. LSI is a very powerful and scalable text analytics technology, but the key to unleashing its potential is understanding how it works and what problems it solves.

Content Discovery is the Major DifferentiatorAs mentioned earlier, all text analytics technologies need to discover the contents of the documents presented to them. This indexing process involves contextual term analysis and is the first step in enabling users to work with the information in a document set. Without this step, the text analytics engine cannot possibly execute a search or categorize documents: how could it when the contents of the documents are unknown? The way in which an analytics engine discovers document content is critical to overall functionality and is a major distinguishing factor.

Many competing indexing technologies have been developed and brought to market. Comprehensively exploring all the different technical solutions and how they function would be time consuming. However, a quick look at the evolution of text analytics is instructive in comparing the relative strengths of the predominant text analytics approaches. When you view text analytics

technologies from the most general perspective, you will find that each platform falls into one of the following types:

»» Lexical, focusing primarily on the linguistic and semantic indicators in text.

»» Probabilistic, focusing primarily on the statistical potentialities in text.

»» LSI-based, focusing primarily on the holistic co-occurrences of unique terms in text.

»» Hybridization, combining elements of any of the previous three.

The First Generation—Term Occurrence and Keyword IndexingBoolean-based search engines initially employed a simple linguistic method to index documents. These platforms performed semantic discovery amounting to counting all the individual words (and word frequencies) found in a document set. The resultant index, therefore, was a comprehensive term look-up list with varying ranks applied depending on term frequency within the document set. Over time, content enrichment methods have been added but are still lexical based given the nature of the system.

With these systems, when a user submits a search term or phrase (known as the query), the search engine compares the query to the contents of the look-up list. A match occurs if a document in the index meets the conditional logic of the query. For example, a query for “dog” returns all documents containing the word “dog” one or more times in a ranked order—documents with many instances of the word rank higher in the results list than ones with fewer instances. A query for “dog NOT cat” returns all documents containing the word “dog” but not the word “cat” because of the conditional logic conveyed by NOT. The advent of this keyword indexing and search methodology revolutionized the way users found the documents for which they were searching. In fact, keyword technology still solves many problems in information management today, especially when the precise query is known to the user. In fact, when the user is looking for the presence (or non-presence) of very

3© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

specific words or phrases, keyword search is still the most effective discovery tool.

What to Like About Keyword Technology

»» Simple approach to indexing the content of documents.

»» Easy to understand resultant document list—prevalence of query terms affects rank.

»» A good way to find very specific or uniquely worded information in smaller document sets.

While technically a linguistic approach (it did, after all, focus on the language within the indexed documents), keyword indexing was not designed to discover word and document meaning. Furthermore, it can be prone to false positives and negatives. One problematic issue with keyword indexing and search was that the presence of the query word (or words) in a document did not guarantee that document’s relevance.

For example, a user looking for documents about “fraud” quickly finds lots of documents might contain the word “fraud,” but those documents are not necessarily about fraudulence. The user then is forced to construct more and more complex queries strung together with conditional operators (such as AND, OR) in order to work within the search engine’s method to find what is truly being sought. The problem is that most of the time what is being sought is not a word or series of words at all. Users really want to find documents in the same way their own human brains work with information every hour of every day: they want to find concepts without having to know the exact terms to use in order to convey that meaning to the analytics engine.

Shortcomings of Keyword Technology

»» Likelihood of introducing false positives and negatives with vague queries.

»» Overly complex query construction when searching for more expansive ideas or themes.

The Next Generation: Linguistic Analysis and IndexingA concept is different from a word or series of words, no matter how much conditional logic is thrown in to make the keyword query more nuanced. According to the online American Heritage Dictionary of the English Language, a concept is “a general idea or understanding of something.” In a document, a concept might be an expressed idea or thematic content employing any number of different words to articulate it. A word is a unique entity with a finite number of restricted meanings. A concept is a larger idea not restricted to any particular terminology in order to express it.

To distinguish between the two, take note of the word (or keyword) “music” versus the idea of auditory stimulation which excites the senses and the mind. The concept of music could encompass thousands of different ideas—rock and roll, blues, Woodstock, Beethoven, Justin Bieber. Undeniably, human thoughts occur in the form of ideas and concepts most of the time, not specific words. Another fact is that rarely will a keyword or series of keywords fully express all the facets of a concept. Therefore, keyword search can be inherently inadequate for most users’ needs when the “right” query is unknown. This realization spurred researchers and software developers past first-generation keyword text analytics. The race was on to figure out how to find the concepts embedded within documents, the “aboutness” rather than just the individual words that composed them.3

The next generation of text analytics platforms continued to rely on the analysis of language to find conceptual content within documents, just as earlier keyword technologies had. These linguistic indexing technologies, though, went beyond the keyword lookup process. They incorporated algorithms and ancillary reference tools in order to interpret the complexities of language found within documents. Such lexical analytics software

“ KeYWORD INDexING

WAS NOT DeSIGNeD TO

DISCOveR WORD AND

DOCUMeNT MeANING”

4© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

came pre-programmed with the static rules of language (grammar) and word use (semantics). Reference cartridges or modules such as dictionaries and thesauri fed into this linguistic indexing methodology.

This approach was certainly more sophisticated than keyword indexing, but the problem was and still is that the rules and conventions of language—grammar, semantics, and colloquial usage—are so fluid and changeable that early indexing software either could not derive concepts accurately or effectively, or they could not keep up with the ever-changing dynamics of modern language. In the 21st century, a word can come into being overnight, or an existing word can be imbued with vastly different meaning and propagated around the world within hours. Linguistic indexing engines, relying on laborious human pre-programming and updating, could not keep pace.

Another problem with the linguistic approach is that it was language dependent, not language agnostic. A dictionary is a reference tool for words in a given language, so if a linguistic indexing engine encountered a document in a language not supported by its reference dictionary, it could not derive meaning from that document. Some other technique or technology was required to resolve these shortcomings.

The Irony Is…

Language, when viewed from a linguistic perspective, seems hopelessly complex and impenetrable due to its rich variety and dynamic state. Language when treated from a mathematical perspective becomes elegantly understandable. The importance of mathematics in language analysis continues to gain traction.4

Advanced Text Analytics: Mathematical Analysis and IndexingIn a very counter-intuitive fashion, researchers and innovators turned away from the analysis of language rules in order to figure out conceptuality within text. Instead, they approached the content indexing problem from a mathematical perspective. Employing an impressive array of mathematical approaches and maneuvers during the indexing process, text analytics vendors were able to create even more advanced text analytics software. No longer were indexing engines dependent on static linguistic references necessitating frequent updating as language usage changed.

Advanced text analytics engines now can rely on statistical analyses of probable meaning within a document—known as a probabilistic indexing approach—or on linear algebraic analyses of total word co-occurrence and correlation—known as an LSI-based indexing approach—to figure out the concepts contained within a document. Other approaches include a hybridization of these two techniques. With these math-based advanced analytics techniques, conceptual search and classification across large volumes of documents is possible with a very high degree of reliability, flexibility, and adaptability. Sophisticated text analytics has finally arrived.

The technology behind probabilistic text analysis leverages research based in statistical computations. It builds upon the ideas of probable confidence in an assumption and the continuous updating of that confidence based on evidence. Applied to text analysis, a probabilistic approach—relying on algorithms rooted in statistical analysis—analyzes local term patterns and computes probable or likely meanings. These calculations in part depend on previous assumptions of meaning, which means that faulty assumptions introduce

“ LANGUAGe WHeN

TReATeD FROM A

MATHeMATICAL PeRSPeCTIve

BeCOMeS eLeGANTLY

UNDeRSTANDABLe”

5© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

error into the textual analysis. The length of the text also influences the accuracy of probabilistic indexing, with shorter texts presenting a particular problem in conceptual derivation.5

Latent Semantic Indexing, or LSI, is a linear algebraic approach to deriving conceptuality out of the textual content of a document. LSI uses sophisticated linear algebraic computations in order to assess term co-occurrence and contextual associations. While the scale of calculations “under the hood” is extensive, the overall approach can be explained in comprehensible and non-technical language. Once understood, the utility of the LSI technique and the myriad features it facilitates becomes quite apparent.

LSI and Concept-Based Text Analytics

In order to figure out what concepts are contained within a document, an LSI-based text analytics engine first must acquire an understanding of term interrelationships. Keep in mind that, as a math-based indexing technique, an LSI engine does not have ancillary linguistic references or any pre-programmed understanding of language at all, so this understanding is mathematical rather than linguistic. Using algorithms based in linear algebra, the LSI technique generates this understanding by assessing all the words within a document and within an entire document set. The engine then calculates word co-occurrences across the entire document set—accounting for different emphases on certain word prevalence—to figure out the interrelationships between words and the logical combinations of word co-occurrence which leads to comprehensible concepts. This ability to derive conceptuality from text is one of its most valuable commercial traits.6

To state it another way, we can compare LSI to human thought and communication. A human must use logical and accepted word combinations in order to convey a thought, regardless of the language used. Too many illogical word co-occurrences create incomprehensible gibberish. LSI uses advanced math to figure out these inherent combinations based on the documents themselves, allowing it to respond to a conceptual query with documents containing the same concept—again, not

the same words, but the same or similar concept. In a way, LSI mimics the human brain in its attempt to make sense out of word co-occurrences and figure out what text really means. If it cannot figure out what the words mean, that probably indicates that the word combinations are meaningless. As Bradford states, conceptuality derived from LSI “correspond remarkably well with judgments of conceptual similarity made by human beings.”7

LSI is not new technology. As a matter of fact, it uses individual mathematical maneuvers that have been known to scientists and mathematicians for decades. In the 1980s, a group of Bellcore scientists applied these mathematical principles to their research in language and subsequently patented the LSI technique. This technology passed hands in the 1990s and also in the first decade of this century until a new organization—Content Analyst Company—was created in 2004 to advance the technology in several markets. Along the way, numerous other patents have been granted around the original LSI approach, allowing it to grow into a full-blown advanced text analytics platform.

“ IN A WAY, LSI MIMICS

THe HUMAN BRAIN IN ITS

ATTeMPT TO MAKe SeNSe

OUT OF WORD

CO-OCCURReNCeS AND

FIGURe OUT WHAT TexT

ReALLY MeANS”

6© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

LSI and Some MisconceptionsThroughout the different stages of LSI development and evolution, the challenges which its inventors and developers had to overcome gave rise to some misconceptions about the LSI indexing technique. Most of these misconceptions derive from its earliest days when the technique was just beginning to evolve. Some of these misconceptions include:

»» LSI is slow and does not scale

»» LSI is expensive to implement and maintain

»» LSI is not defensible

»» LSI does not differentiate semantic nuances

»» LSI cannot replace human inspection of documents

One of the biggest misconceptions about LSI technology is that it is slow and non-scalable when presented with large volumes of documents. We can trace this misconception back to the days of less powerful, more expensive hardware. Sometimes, technology is ahead of its time and utilizes techniques to which other technologies must catch up. The reality is that LSI was invented and patented during a time when the sophisticated math which the technique requires consumed vast resources on the limited hardware available. Hardware constraints admittedly slowed down not only the indexing process but also the text analytics functions carried out post-indexing.

As already touched upon, though, the rapid reduction in hardware costs and huge gains in performance witnessed over the past five years (resulting in inexpensive many-core processors and cheap memory) have eliminated these problems. Not only does an LSI engine now have vast hardware resources available to it on a wide range of servers, but its ongoing evolution has resulted in distributed indexing capabilities and load-sharing deployments. Concept searches typically result in sub-one-second results, and hundreds of thousands of documents can be classified, organized, and tagged in mere minutes as opposed to the days or weeks required of humans to assess large volumes of information. LSI performs its functions rapidly on very affordable hardware.8

The associated perception that LSI is expensive to install and maintain is refuted by the same explanation. Cheap hardware, extensible features, and deployment best practices learned along the way all contribute to an economical answer for anybody’s advanced text analytics needs. For the value it provides, LSI is a compelling indexing technology for concept-based analytics.

Microchip flashback

in 1990, intel®

introduced the

33 Mhz 486

microprocessor chip

with a processing

speed of 27 MipS.

today’s intel Xeon®

chip is capable of

4.4 Ghz, which is

exponentially more

powerful than the

state-of-the-art 486

was at the time. in

fact, the Xeon is 133

times faster than

its 1990 ancestor

486 chip.

7© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

Another misconception plays more upon the human fears of automation by questioning how defensible the results of LSI indexing and concept-based analytics are. Humans are always suspicious when “the machine” encroaches upon tasks and abilities thought to be best performed by intelligent human and well-trained humans. Because LSI effectively removes humans from the process of reading and assessing the content of documents, critics can easily play upon the fear of the unknown. After all, if no humans read all the documents, who really knows what’s in them?

This fear is understandable but has no basis in research or documented observation. The algorithms and proprietary technology that power LSI indexing are well documented and can be defended by the principles of advanced mathematics. The reality is that the math of LSI and its approach driven by sophisticated analysis of total word co-occurrence can be defended both from a mathematical and linguistic perspective.

A particularly erroneous claim is that LSI cannot detect word similarities or semantic variations. This misconception insists that LSI cannot distinguish between “cool” as an indicator of temperature versus “cool” as a qualitative judgment of relevance. This claim is patently false. As a matter of fact, LSI is ideal for penetrating the mysteries of semantics—including synonymy and polysemy—and based upon its core approach (term co-occurrence) actually figures out semantic quandaries in the same way humans do.9

If one person says to another, “this is cool,” the recipient might not immediately understand what is being indicated, especially if the speaker has touched something that might actually be cold. The hearer might ask for clarification with something elegant like “what is?” The first person might then elaborate with, “This bronze statue is really cool. It’s quite post-modern.” With the additional accompanying words in the follow up, the speaker provides enough term co-occurrence for the hearer to understand the meaning, which has nothing to do with temperature. Conversely, a metal smith casting the same bronze statue might also say “this is cool” referring to the temperature, indicating it’s ready to be handled without getting burned. LSI analysis of text works exactly like this and mirrors the human ability to interpret meaning based on term co-occurrence.

A final misconception involves the question of technology replacing human assessment of document content. It is human nature to resist technologies and processes that we don’t fully understand or that we feel are replacing us, but that does not mean that the technology itself is not effective or appropriate to implement. With the nearly exponential explosion in volume of enterprise data, technology must replace the inadequate solutions provided by earlier text analytics techniques and expensive human activity. The knowledge management market has indeed reached that inflection point where inexpensive hardware costs coupled with the advanced capabilities of CAAT™ create a compelling counter-argument for those who are technophobic.

LSI and CAAT™ As with all other vendors of search and text analytics technologies, Content Analyst Company has had to focus on the tenets of precision, speed, and flexibility. Overcoming the inherent obstacles found within text that distract CAAT™ and hamper its ability to determine document conceptuality was also a necessity. For example, filtering out header and footer information in emails is necessary in order to extract the useful authored content in these types of documents. Considerable research has gone into preparing document text for more precise conceptual analysis.

In the earlier days of the technology, the inventors and developers had the additional problem of less powerful but much more expensive hardware. The math required to perform LSI indexing is not insignificant, so the workstations and servers of the 1990s and early 2000s had to be robust, with as much memory (RAM) and CPU horsepower as possible. Furthermore, the 32-bit processors and operations systems prevalent at that time were not capable of addressing large amounts of RAM. Until the inflection point of more powerful but less expensive hardware occurred in the mid-2000s, the ongoing development of the CAAT engine focused on refinement of the code base to allow for more accurate and speedier functionality.

Distributed subsystem deployment and text filtering capabilities were also incorporated along the way, the latter of which suppresses extraneous or “garbage

8© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

text” during indexing. Finally, the evolution of CAAT over the years grew to encompass not only traditional search (concept search) and concept-based document classification (multiple techniques and optimizing algorithms), but also dynamic clustering, document summarization, primary language identification, and advanced text comparisons (for thread detection in emails and for identifying duplicate text). CAAT is now an advanced text analytics platform with dozens of discrete analytical applications which developers can assemble in combination to creatively add concept-based text analytics to larger software solutions.

Key CapabilitiesThe key features of CAAT include concept-based search and document organization/classification. Concept search and concept-based document classification are far more powerful than keyword-based approaches, for reasons discussed previously. Now, the user does not have to know the “right” words to use when submitting queries, and documents that are related to each other conceptually can be grouped together regardless of whether they share the same terminology.

Document classification is particularly attractive for software vendors who need workflow automation and document routing within their solutions – such as enterprise content management, enterprise archiving, compliance and e-discovery software vendors. The ability to rapidly organize large volumes of documents based on their relevance to each other or to an overarching category reduces or eliminates costly human document inspection. With the reduction of human document inspection come the benefits of more precise classification results. Properly trained text analytics software such as CAAT more objectively and consistently assesses the conceptual and thematic content of documents—it does not get tired (and therefore increasingly inconsistent), and it does not get tripped up over the interpretive nuances of language that plague human inspection.

Because of the importance of classification, CAAT includes two major ways to organize documents: clustering and categorization. These two modes of document classification differ in the amount of human intervention

required to establish and define the organizational structure, known as the taxonomy. In theory, a taxonomy is nothing more than the discrete organization units—arranged in either a flat or hierarchical manner—into which documents can be placed, along with the rules dictating the type of content a document must contain in order to qualify for a category. In practice, taxonomy development and maintenance is an enormous undertaking for enterprises, requiring highly specialized knowledge workers (known as taxonomists) and their support staff. CAAT’s ability to cluster documents into an automatically generated taxonomy or alternately accommodating the more refined process of human-trained document categorization means that nearly any automated workflow requiring classification can be supported.

The value propositions for these two classification methods are compelling. Automated clustering provides rapid organization of documents so that users can quickly understand the conceptual composition and distribution spread within large document sets. Clustering automatically classifies documents based on each document’s predominant concept or theme—it also creates the taxonomy structure and category naming scheme without human intervention. Because the unsupervised clustering feature is dynamic and can be just-in-time within a solution’s workflow, it is ideal when quick and general insight into a document set is required.

“ DOCUMeNTS THAT ARe

ReLATeD TO eACH OTHeR

CONCePTUALLY CAN Be

GROUPeD TOGeTHeR

ReGARDLeSS OF WHeTHeR

THeY SHARe THe SAMe

TeRMINOLOGY.”

9© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

Categorization, on the other hand, allows users to participate in the taxonomy development and analytics training process. As with all other types of supervised document classification technologies, categorization demands more upfront effort during taxonomy development and the requisite learning process to define the categories for CAAT. The compelling benefit is a much more refined and accurate result set placed into pre-defined categories of interest. Categorization is a powerful solution where the best of human insight and software efficiency can be combined to yield the most accurate classification results possible.

CAAT also provides a term expansion feature, which is capable of detecting for any word all the other terms in the indexed document set which are either highly correlated or synonymous with it. Using the “dog” example from earlier, CAAT would also identify “husky,” “spaniel,” “mutt,” “pup,” and “man’s best friend,” as highly correlated or synonymous terms.

Being a math-based indexing engine, CAAT has no native understanding of language, the benefit of which is complete language agnosticism. Indexing of German documents enables the same analytics features and yields the same accurate results as the indexing of documents in English, Chinese, or Arabic. Despite being language agnostic, CAAT does have the ability to detect the differences between languages due to its term analysis. Therefore, it can identify the primary language of a document.

All of these features can be accessed whenever needed in a larger software platform to increase the findability of documents and improve the accuracy and relevance of document classification.

Learning More About CAATThe power of CAAT and its LSI indexing technology has been integrated into dozens of software solutions in a number of different markets. To learn more about these success stories, go to www.contentanalyst.com.

CAAT finds tight groupings of concepts, then groups and subgroups are picked based on settings. Numeric Values help interpret results, and document scores indicate closeness to the center of the clusters.

Example documents plus threshold define “hit spheres.” Documents in the search index which fall in the “hit spheres” get categorized.

figure 1.2

figure 1.1

10© 2013 Content Analyst, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC

in the United States. All other marks are the property of their respective owners.

About Content Analyst Company

We provide powerful and proven Advanced Analytics that exponentially reduce the time needed to discern relevant information from unstructured data. CAAT, our dynamic suite of text analytics technologies, delivers significant value wherever knowledge workers need to extract insights from large amounts of unstructured data. Our capabilities are easily integrated into any software solution, and our support strategy for our partners is second to none.

© 2013 Content Analyst Company, LLC. All rights reserved. Content Analyst, CAAT and the Content Analyst and CAAT logos are registered trademarks of Content Analyst, LLC in the United States. All other marks are the property of their respective owners.

References1 Frank Ohlhorst, “The Promise of Big Data,” Infoworld, September 2010.

2 John Markoff, “Armies of Expensive Lawyers, Replaced by Cheaper Software,”

New York Times, March 4, 2011.

3 C. Korycinski and Alan F. Newell, “Natural Language Processing and Automatic

Indexing,” The Indexer, April 1990.

4 Mark Liberman, “Linguists Who Count,” Language Log, May 28, 2009.

5 Yangqui Song et al, “Short Text Conceptualization Using a Probabilistic

Knowledgebase,” Proceedings of the 22nd International Joint Conference on

Artificial Intelligence, 2011.

6 Roger Bradford, “Comparability of LSI and Human Judgment in Text Analysis

Tasks,” Proceedings of the Applied Computing Conference, September 2009.

7 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”

8 Roger Bradford, “Implementation Techniques for Large-Scale Latent Semantic

Indexing Applications,” Proceedings of the 20th ACM International Conference on

Information and Knowledge Management, October 2011.

9 Bradford, “Comparability of LSI and Human Judgment in Text Analysis Tasks.”