5
White Paper Search vs.Text Classification Increasing the signal, decreasing the noise

Search vs Text Classification

Embed Size (px)

DESCRIPTION

Is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw. Text Classification is an alternative to search that may be more appropriate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teaching a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.

Citation preview

Page 1: Search vs Text Classification

1 West Street New York NY 10004 | 646-545-3900 | [email protected] | networkedinsights.com

White Paper

Search vs.Text ClassificationIncreasing the signal, decreasing the noise

TEENWOLF

Page 2: Search vs Text Classification

2

Network InsightsWhite Paper

Topic discovery—letting data speak for itself Topic discovery is a valuable type ofsemantic analysis based on text classification. Whereas sentiment analysis simply reveals people’s likes and dislikes, semantic analysis refers to a group of methods that allow machines to discover the fundamental patterns of words or phrases that act as building blocks in a large set of text. Topics, themes, sentiment and similar elements of meaning appear as intricate weavings of those fundamental patterns. So semantic analysis is the summarization of large amounts of text by automatically discovering the topics and themes within.

By grouping social media posts based on semantic similarity, rather than preset sentiment categories such as positive, nega-tive and neutral, topic discovery can help companies uncover important information – for example, what exactly people are saying about a product or service; where and how they use it; the features they use most; and the enhancements or new offerings they’re interested in. All of this information can ultimately drive product development, new revenue streams and strategies for market-ing, advertising and media planning.

Since the advent of the World Wide Web, businesses and consumers have used a variety of ways to find information. These various methods of discovery have trained us to think and behave in ways that make understanding analytics challenging. In fact, what makes retrieving information easy for individuals is not the manner in which we should examine social data. Confused?

In the infancy of the commercial public Web, navigation was nearly impos-sible without directories and then information portals. With the explosion of the Web in the late 1990s, keyword searching and using search engines has become as ubiquitous as the Internet itself. While the underlying methods of search have evolved over the years, its primary use has stayed constant since the early days of companies like Yahoo!, Altavista, Lycos, Excite and Google. Reflecting its mass popularity and understanding, search is often the first tool applied to a wide variety of data challenges.

But is search always the right solution? There are many things you can do with a hammer, but it’s not so great if you need to turn a screw.

To learn what customers think about your products and services, you may need to apply sentiment analysis across millions of social media posts. Or, to guide your media buying, you might use topic discovery to uncover market trends in the social conversation.

In either case, using search to identify the set of posts you’ll submit to scrutiny could send your social media analysis down the wrong path from the start. Your approach to conducting sentiment analysis or topic discovery could be spot on. But if it’s based on a number of posts that aren’t actually about what you think they are, which typically happens with search, the noise created can flaw the inferences and conclusions you ultimately draw.

Text classification is an alternative to search that may be more appropri-ate for social media data analysis. Text classification is the task of assigning predefined categories to free-text documents. It can provide conceptual views of document collections and has important applications in the real world. Using text classification as the foundation for analysis – i.e., teach-ing a machine to categorize posts the way humans do – can dramatically improve your ability to gather the right data and, ultimately, increase the chances that you’ll uncover what you need to know.

Search vs. Text ClassificationIncreasing the signal, decreasing the noise

Networked Insights

Page 3: Search vs Text Classification

3

Search vs. Text Classification

White Paper Networked Insights

The impact of bad dataA look at several related but distinct topics illustrates how seriously the problems of search can impact analysis.

A Networked Insights analyst designed search queries for five topics that moms typically discuss – pregnancy and newborns; school-aged children; food, nutrition and health; shopping and money; and illness and injury. Searches were run on the five topics, then another analyst reviewed the results under two test scenarios to determine how well the search delivered posts fitting the intended criteria as defined by the query.

In the first test, the analyst reviewed only the top 20 results returned by each search as ordered by the search engine. In the second test, the analyst reviewed a random sample of 200 results returned by the search. In each case, the analyst was asked to judge whether each resulting post was appropriate for the intended category or if it fit better in a different one. The percent of appropriate posts is a measure of the “precision” of the search.

The test results (Table 1) reveal search’s severe limitations. Precision was high when only the top 20 results were examined (90 percent or higher), but falls precipitously when examining a larger number of randomly sam-pled posts. In only one search, pregnancy and newborns, did the results yield a somewhat reliable level of precision (86.5 percent). In three of the five searches, precision rates were under 50 percent.

In practical terms, these results mean there’s a greater chance that a ran-domly selected search result will not meet the intended criteria than that it will. Said another way, search might be used to support other analyses by returning a large number of posts assumed to cover the same basic topic. The problem: the majority of the data isn’t relevant to the topic you want to understand.

Table 1. Keyword Search Precision

Desired Topic Top 20 Results Only Random Sample

Pregnancy and newborns 95% 86.5%School-aged children 95% 19.5%Food, nutrition, health 90% 39.5%Shopping and money 100% 57.5%Illness and Injury 100% 41%Overall 96% 48.8%

Significant problems arise with search when you’re after a broad collection of similar posts, not a handful of the best ones.

traditional search

Page 4: Search vs Text Classification

4

Search vs. Text Classification

White Paper Networked Insights

The shortcomings of searchBy definition, the intent of search is to uncover the best responses to a query. A search engine goes out and grabs hundreds of thousands of posts that match the word or phrase programmed into the query and attempts to rank them in order of relevance. Its goal is to put the post most likely to be the one you’re looking for at the top of the list. The search engine does this effectively, as seen in the first column of results in Table 1.

Significant problems arise with search when you’re after a broad collection of similar posts, not a handful of the best ones. This is often the case in social media analysis, when the goal is to analyze millions of posts to identify trends that can inform marketing decisions or uncover insights that can reveal business opportunities. Simply stated, more data points are sometimes much better than a few. In these cases, search will undermine your efforts. The first 20, or even 200, posts might be great matches. But the last 20 or 200 might not match at all, as seen in the second results column of Table 1.

Search methodology has other significant shortcomings, which are more apparent when it’s applied to social media data than when used with other, more structured forms of text. For example, search struggles when you’re looking for something more complicated than whether or not a document contains a particular word or phrase. Search cannot contemplate the context of how words and phrases are used in relationship to one another; it simply can identify whether or not that word or phrase is present.

Search also suffers a bias problem. If the searcher uses words that are not a direct reflection of the words that millions of other people use for a given topic, search can’t accommodate the differences.

To sum up the problems, search does not inherently provide a mechanism for determining which results should belong to the desired group and which should not. The norm is to simply say that all posts that match a query belong to the desired topic and use all of them in further analyses.

A better way — the power of classificationIn contrast to search, text classification uses machine-learning algorithms to learn from a set of examples how to separate posts into topics. If an algorithm, or program, is presented with examples of how a human would separate posts based on topic, it can learn to mimic that person’s process on new, previously unseen posts. One major advantage of this approach is that the program can scale up to perform its process on millions of docu-ments. People do not scale up so easily.

Classification offers the potential to produce a dataset in which all of the posts are relevant to the topics being analyzed. The last 20 are as valuable to the analysis as the first 20.

© 2011 Networked Insights, Inc. All rights reserved.

Search cannot contemplate the context of how words and phrases are used in relationship to one another; it simply can identify wheth-er or not that word or phrase is present.

Classification offers the potential to produce a dataset in which all of the posts are relevant to the topics being analyzed. The last 20 are as valuable to the analysis as the first 20.

traditional search

classification

Page 5: Search vs Text Classification

5

Search vs. Text Classification

White Paper Networked Insights

The classification process begins with a human analyst selecting a sampling of posts that relate to a specific topic, such as pregnancy and newborns. The analyst also selects posts that are irrelevant, so the algorithm being used can detect the difference. These posts serve as the training examples from which the machine will learn.

A variety of algorithms can be used for classification, including artificial neural networks, support vector machines and Naive Bayes algorithms. Selecting the right algorithm and tuning it are critical, as some do well at certain problems and not so well at others.

In the next step, the algorithm learns how to categorize new posts by reading the example posts and identifying general rules that differentiate the relevant and irrelevant posts. For example, when the program sees the phrases “little one” and “hospital” together in a post, it might notice that the probability the post belongs to the pregnancy and newborns category increases significantly. It then uses this knowledge in categorizing other posts. The goal is not to memorize the training examples, but to find gen-eral characteristics that help the algorithm categorize new posts.

Table 2 adds a third column to Table 1 that shows the result of using clas-sification instead of search to identify posts presumably related to the five mom topics. The analysis approach for classification was the same as that applied to the search precision test. An independent analyst reviewed 200 randomly sampled results from classification and determined whether or not they matched the intended topic. The improvement over the search precision test is dramatic. The overall precision of using classification was 86 percent vs. 49 percent using search across all posts. For one topic – food, nutrition and health – precision rose from 39.5 percent with search to 100 percent through classification.

Table 2. Precision of Using Classification to Identify Posts in Comparison to Search

Desired Topic Top 20 Results Only Random Sample Classification

Pregnancy and newborns 95% 86.5% 88.0%School-aged children 95% 19.5% 72%Food, nutrition, health 90% 39.5% 100%Shopping and money 100% 57.5% 87%Illness and Injury 100% 41% 83%Overall 96% 48.8% 86%

Classification clearly provides greater precision in social data analysis. It offers deeper insights – both on a broad scale and when drilling into specific topics – than can be gleaned from standard search techniques.

Questions about this report? Want a free consultation on how social data can improve your media planning and other marketing? Contact us.

[email protected]© 2011 Networked Insights, Inc. All rights reserved.

Millions of people use search every day to find what they’re looking for online. But search can send you off into the social media wilderness if you’re using traditional monitoring tools to discover conversations and trends. So stop searching. Instead, start asking how real-time data can support your existing decision-making processes and then use classification techniques to cut through the noise and sharpen your social analysis.

creating a stronger signal