Natural language processing and social media intelligence

2012Issue 1

A quarterly journal

06The third wave of customer analytics

30The art and science of new analytics technology

58Building the foundation for a data science culture

Reshaping the workforce with the new analytics

Mike DriscollCEO, Metamarkets

44Natural language processing and social media intelligence

02 PwC Technology Forecast 2012 Issue 1

Acknowledgments

AdvisoryPrincipal & Technology LeaderTom DeGarmo

US Thought LeadershipPartner-in-ChargeTom Craren

Strategic MarketingNatalie KontraJordana Marx

Center for Technology & Innovation Managing EditorBo Parker

EditorsVinod Baya Alan Morrison

ContributorsGalen GrumanSteve Hamby and Orbis TechnologiesBud MathaiselUche OgbujiBill RobertsBrian Suda

Editorial AdvisorsLarry Marion

Copy EditorLea Anne Bantsari

TranscriberDawn Regan

Reshaping the workforce with the new analytics 03

US studioDesign LeadTatiana Pechenik

DesignerPeggy Fresenburg

IllustratorsDon Bernhardt James Millefolie

ProductionJeff Ginsburg

Online Managing Director Online MarketingJack Teuber

Designer and ProducerScott Schmidt

AnimatorRoger Sano

ReviewersJeff Auker Ken Campbell Murali Chilakapati Oliver Halter Matt Moore Rick Whitney

Special thanksCate CorcoranWIT Strategy

Nisha PathakMetamarkets

Lisa Sheeran Sheeran/Jager Communication

Industry perspectivesDuring the preparation of this publication, we benefited greatly from interviews and conversations with the following executives:

Kurt J. BilaferRegional Vice President, Analytics, Asia Pacific Japan SAP

Jonathan ChihorekVice President, Global Supply Chain Systems Ingram Micro

Zach DevereauxChief Analyst Nexalogy Environics

Mike DriscollChief Executive Officer Metamarkets

Elissa FinkChief Marketing Officer Tableau Software

Kaiser FungAdjunct Professor New York University

Kent KusharChief Information Officer E. & J. Gallo Winery

Josée LatendresseOwner Latendresse Groupe Conseil

Mario Leone Chief Information Officer Ingram Micro

Jock MackinlayDirector, Visual Analysis Tableau Software

Jonathan NewmanSenior Director, Enterprise Web & EMEA eSolutions Ingram Micro

Ashwin RanganChief Information Officer Edwards Lifesciences

Seth RedmoreVice President, Marketing and Product Management Lexalytics

Vince SchiavoneCo-founder and Executive Chairman ListenLogic

Jon SladeGlobal Online and Strategic Advertising Sales Director Financial Times

Claude ThéoretPresident Nexalogy Environics

Saul ZambranoSenior Director, Customer Energy Solutions Pacific Gas & Electric


Most enterprises are more than eager to further develop their capabilities in social media intelligence (SMI)—the ability to mine the public social media cloud to glean business insights and act on them. They understand the essential value of finding customers who discuss products and services candidly in public forums. The impact SMI can have goes beyond basic market research and test marketing. In the best cases, companies can uncover clues to help them revisit product and marketing strategies.

“Ideally, social media can function as a really big focus group,” says Jeff Auker, a director in PwC’s Customer Impact practice. Enterprises, which spend billions on focus groups, spent nearly $1.6 billion in 2011 on social media marketing, according to Forrester Research. That number is expected to grow to nearly $5 billion by 2016.1

1 Shar VanBoskirk, US Interactive Marketing Forecast, 2011 To 2016, Forrester Research report, August 24, 2011, http://www.forrester.com/rb/Research/us_interactive_marketing_forecast%2C_2011_to_2016/q/id/59379/t/2, accessed February 12, 2012.

Natural language processing and social media intelligenceMining insights from social media data requires more than sorting and counting words.By Alan Morrison and Steve Hamby

Auker cites the example of a media company’s use of SocialRep,2 a tool that uses a mix of natural language processing (NLP) techniques to scan social media. Preliminary scanning for the company, which was looking for a gentler approach to countering piracy, led to insights about how motivations for movie piracy differ by geography. “In India, it’s the grinding poverty. In Eastern Europe, it’s the underlying socialist culture there, which is, ‘my stuff is your stuff.’ There, somebody would buy a film and freely copy it for their friends. In either place, though, intellectual property rights didn’t hold the same moral sway that they did in some other parts of the world,” Auker says.

This article explores the primary characteristics of NLP, which is the key to SMI, and how NLP is applied to social media analytics. The article considers what’s in the realm of the possible when mining social media text, and how informed human analysis becomes essential when interpreting the conversations that machines are attempting to evaluate.

2 PwC has joint business relationships with SocialRep, ListenLogic, and some of the other vendors mentioned in this publication.


Natural language processing: Its components and social media applicationsNLP technologies for SMI are just emerging. When used well, they serve as a more targeted, semantically based complement to pure statistical analysis, which is more scalable and able to tackle much larger data sets. While statistical analysis looks at the relative frequencies of word occurrences and the relationships between words, NLP tries to achieve deeper insights into the meanings of conversations.

The best NLP tools can provide a level of competitive advantage, but it’s a challenging area for both users and vendors. “It takes very rare skill sets in the NLP community to figure this stuff out,” Auker says. “It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.”

First-generation social media monitoring tools provided some direct business value, but they also left users with more questions than answers. And context was a key missing ingredient. Rick Whitney, a director in PwC’s Customer Impact practice, makes the following distinction between the first- and second-generation SMI tools: “Without good NLP, the first-generation tools don’t give you that same context,” he says.

What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations to ensure that novel and interesting findings that inadvertently could be screened out make it through the filters.

Types of NLPNLP consists of several subareas of computer-assisted language analysis, ways to help scale the extraction of meaning from text or speech. NLP software has been used for several years to mine data from unstructured data sources, and the software had its origins in the intelligence community. During the past few years, the locus has shifted to social media intelligence and marketing, with literally hundreds of vendors springing up.

NLP techniques span a wide range, from analysis of individual words and entities, to relationships and events, to phrases and sentences, to document-level analysis. (See Figure 1.) The primary techniques include these:

Word or entity (individual element) analysis

•Word sense disambiguation—Identifies the most likely meaning of ambiguous words based on context and related words in the text. For example, it will determine if the word “bank” refers to a financial institution, the edge of a body of water, the act of relying on something, or one of the word’s many other possible meanings.

•Named entity recognition (NER)—Identifies proper nouns. Capitalization analysis can help with NER in English, for instance, but capitalization varies by language and is entirely absent in some.

•Entity classification—Assigns categories to recognized entities. For example, “John Smith” might be classified as a person, whereas “John Smith Agency” might be classified as an organization, or more specifically “insurance company.”

“It takes very rare skill sets in the NLP community to figure this stuff out. It’s incredibly processing and storage intensive, and it takes awhile. If you used pure NLP to tell me everything that’s going on, by the time you indexed all the conversations, it might be days or weeks later. By then, the whole universe isn’t what it used to be.”

—Jeff Auker, PwC


about its competitors—even though a single verb “blogged” initiated the two events. Event analysis can also define relationships between entities in a sentence or phrase; the phrase “Sally shot John” might establish a relationship between John and Sally of murder, where John is also categorized as the murder victim.

•Co-reference resolution—Identifies words that refer to the same entity. For example, in these two sentences—“John bought a gun. He fired the gun when he went to the shooting range.”—the “He” in the second sentence refers to “John” in the first sentence; therefore, the events in the second sentence are about John.

•Part of speech (POS) tagging—Assigns a part of speech (such as noun, verb, or adjective) to every word to form a foundation for phrase- or sentence-level analysis.

Relationship and event analysis

•Relationship analysis—Determines relationships within and across sentences. For example, “John’s wife Sally …” implies a symmetric relationship of spouse.

•Event analysis—Determines the type of activity based on the verb and entities that have been assigned to a classification. For example, an event “BlogPost” may have two types associated with it—a blog post about a company versus a blog post

Figure 1: The varied paths to meaning in text analytics

Machines need to review many different kinds of clues to be able to deliver meaningful results to users.

Meaning

Words

Sentences

Documents

Metadata

Socialgraphs

Lexicalgraphs


Syntactic (phrase and sentence construction) analysis

•Syntactic parsing—Generates a parse tree, or the structure of sentences and phrases within a document, which can lead to helpful distinctions at the document level. Syntactic parsing often involves the concept of sentence segmentation, which builds on tokenization, or word segmentation, in which words are discovered within a string of characters. In English and other languages, words are separated by spaces, but this is not true in some languages (for instance, Chinese).

•Language services—Range from translation to parsing and extracting in native languages. For global organizations, these services are a major differentiator because of the different techniques required for different languages.

Document analysis

•Summarization and topic identification—Summarizes (in the case of topic identification) in a few words the topic of an entire document or subsection. Summarization, by contrast, provides a longer summary of a document or subsection.

•Sentiment analysis—Recognizes subjective information in a document that can be used to identify “polarity” or distinguish between entirely opposite entities and topics. This analysis is often used to determine trends in public opinion, but it also has other uses, such as determining confidence in facts extracted using NLP.

•Metadata analysis—Identifies and analyzes the document source, users, dates, and times created or modified.

NLP applications require the use of several of these techniques together. Some of the most compelling NLP applications for social media analytics include enhanced extraction, filtered keyword search, social graph analysis, and predictive and sentiment analysis.

Enhanced extractionNLP tools are being used to mine both the text and the metadata in social media. For example, the inTTENSITY Social Media Command Center (SMCC) integrates Attensity Analyze with Inxight ThingFinder—both established tools—to provide a parser for social media sources that include metadata and text. The inTTENSITY solution uses Attensity Analyze for predicate analysis to provide relationship and event analysis, and it uses ThingFinder for noun identification.

Filtered keyword search Many keyword search methods exist. Most require lists of keywords to be defined and generated. Documents containing those words are matched. WordStream is one of the prominent tools in keyword search for SMI. It provides several ways for enterprises to filter keyword searches.

Social graph analysisSocial graphs assist in the study of a subject of interest, such as a customer, employee, or brand. These graphs can be used to:

•Determine key influencers in each major node section

•Discover if one aspect of the brand needs more attention than others

• Identify threats and opportunities based on competitors and industry

•Provide a model for collaborative brainstorming


Many NLP-based social graph tools extract and classify entities and relationships in accordance with a defined ontology or graph. But some social media graph analytics vendors, such as Nexalogy Environics, rely on more flexible approaches outside standard NLP. “NLP rests upon what we call static ontologies—for example, the English language represented in a network of tags on about 30,000 concepts could be considered a static ontology,” Claude Théoret, president of Nexalogy Environics, explains. “The problem is that the moment you hit something that’s not in the ontology, then there’s no way of figuring out what the tags are.”

In contrast, Nexalogy Environics generates an ontology for each data set, which makes it possible to capture meaning missed by techniques that are looking just for previously defined terms. “That’s why our stuff is not

quite real time,” he says, “because the amount of number crunching you have to do is huge and there’s no human intervention whatsoever.” (For an example of Nexalogy’s approach, see the article, “The third wave of customer analytics,” on page 06.)

Predictive analysis and early warningPredictive analysis can take many forms, and NLP can be involved, or it might not be. Predictive modeling and statistical analysis can be used effectively without the help of NLP to analyze a social network and find and target influencers in specific areas. Before he came to PwC, Mark Paich, a director in the firm’s advisory service, did some agent-based modeling3 for a Los Angeles–based manufacturer that hoped to

3 Agent-based modeling is a means of understanding the behavior of a system by simulating the behavior of individual actors, or agents, within that system. For more on agent-based modeling, see the article “Embracing unpredictability” and the interview with Mark Paich, “Using simulation tools for strategic decision making,” in Technology Forecast 2010, Issue 1, http://www.pwc.com/us/en/technology-forecast/winter2010/index.jhtml, accessed February 14, 2012.

change public attitudes about its products. “We had data on what products people had from the competitors and what people had products from this particular firm. And we also had some survey data about attitudes that people had toward the product. We were able to say something about what type of people, according to demographic characteristics, had different attitudes.”

Paich’s agent-based modeling effort matched attitudes with the manufacturer’s product types. “We calibrated the model on the basis of some fairly detailed geographic data to get a sense as to whose purchases influenced whose purchases,” Paich says. “We didn’t have direct data that said, ‘I influence you.’ We made some assumptions about what the network would look like, based on studies of who talks to whom. Birds of a feather flock together, so people in the same age groups who have other things in common

What constitutes good NLP is open to debate, but it’s clear that some of the more useful methods blend different detailed levels of analysis and sophisticated filtering, while others stay attuned to the full context of the conversations.


tend to talk to each other. We got a decent approximation of what a network might look like, and then we were able to do some statistical analysis.”

That statistical analysis helped with the influencer targeting. According to Paich, “It said that if you want to sell more of this product, here are the key neighborhoods. We identified the key neighborhood census tracts you want to target to best exploit the social network effect.”

Predictive modeling is helpful when the level of specificity needed is high (as in the Los Angeles manufacturer’s example), and it’s essential when the cost of a wrong decision is high.4 But in other cases, less formal social media intelligence collection and analysis are often sufficient. When it comes to brand awareness, NLP can help provide context surrounding a spike in social media traffic about a brand or a competitor’s brand.

That spike could be a key data point to initiate further action or research to remediate a problem before it gets worse or to take advantage of a market opportunity before a competitor does. (See the article, “The third wave of customer analytics,” on page 06.) Because social media is typically faster than other data sources in delivering early indications, it’s becoming a preferred means of identifying trends.

4 For more information on best practices for the use of predictive analytics, see Putting predictive analytics to work, PwC white paper, January 2012, http:// www.pwc.com/us/en/increasing-it-effectiveness/publications/predictive-analytics-to-work.jhtml, accessed February 14, 2012.

Many companies mine social media to determine who the key influencers are for a particular product. But mining the context of the conversations via interest graph analysis is important. “As Clay Shirky pointed out in 2003, influence is only influential within a context,” Théoret says.

Nearly all SMI products provide some form of timeline analysis of social media traffic with historical analysis and trending predictions.

Sentiment analysisEven when overall social media traffic is within expected norms or predicted trends, the difference between positive, neutral, and negative sentiment can stand out. Sentiment analysis can suggest whether a brand, customer support, or a service is better or worse than normal. Correlating sentiment to recent changes in product assembly, for example, could provide essential feedback.

Most customer sentiment analysis today is conducted only with statistical analysis. Government intelligence agencies have led with more advanced methods that include semantic analysis. In the US intelligence community, media intelligence generally provides early indications of events important to US interests, such as assessing the impact of terrorist activities on voting in countries the Unites States is aiding, or mining social media for early indications of a disease outbreak. In these examples, social media prove to be one of the fastest, most accurate sources for this analysis.

“Our models are built on seeds from analysts with years of experience in each industry. We can put in the word

‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy.’ The models combined could be strings of 250 filters of various types.”

—Vince Schiavone, ListenLogic


NLP-related best practicesAfter considering the breadth of NLP, one key takeaway is to make effective use of a blend of methods. Too simple an approach can’t eliminate noise sufficiently or help users get to answers that are available. Too complicated an approach can filter out information that companies really need to have.

Some tools classify many different relevant contexts. ListenLogic, for example, combines lexical, semantic, and statistical analysis, as well as models the company has developed to establish specific industry context. “Our models are built on seeds from analysts with years of experience in each industry. We can put in the word ‘Escort’ or ‘Suburban,’ and then behind that put a car brand such as ‘Ford’ or ‘Chevy,’” says Vince Schiavone, co-founder and executive chairman of ListenLogic. “The models combined could be strings of 250 filters of various types.” The models fall into five categories:

Table 1: A few NLP best practices

Strategy Description Benefits

Mine the aggregated data.

Many tools monitor individual accounts. Clearly enterprises need more than individual account monitoring.

Scalability and efficiency of the mining effort are essential.

Segment the interest graph in a meaningful way.

Regional segmentation, for instance, is important because of differences in social media adoption by country.

Orkut is larger than Facebook in Brazil, for instance, and Qzone is larger in China. Global companies need global social graph data.

Conduct deep parsing.

Deep parsing takes advantage of a range of NLP extraction techniques rather than just one.

Multiple extractors that use the best approaches in individual areas—such as verb analysis, sentiment analysis, named entity recognition, language services, and so forth—provide better results than the all-in-one approach.

Align internal models to the social model.

After mining the data for social graph clues, the implicit model that results should be aligned to the models used for other data sources.

With aligned customer models, enterprises can correlate social media insights with logistics problems and shipment delays, for example. Social media serves in this way as an early warning or feedback mechanism.

Take advantage of alternatives to mainstream NLP.

Approaches outside the mainstream can augment mainstream tools.

Tools that take a bottom-up approach and surface more flexible ontologies, for example, can reveal insights other tools miss.

•Direct concept filtering—Filtering based on the language of social media

•Ontological—Models describing specific clients and their product lines

•Action—Activity associated with buyers of those products

•Persona—Classes of social media users who are posting

•Topic—Discovery algorithms for new topics and topic focusing

Other tools, including those from Nexalogy Environics, take a bottom-up approach, using a data set as it comes and, with the help of several proprie-tary universally applicable algorithms, processing it with an eye toward catego-rization on the fly. Equally important, Nexalogy’s analysts provide interpreta-tions of the data that might not be evident to customers using the same tool. Both kinds of tools have strengths and weak-nesses. Table 1 summarizes some of the key best practices when collecting SMI.


statistical analysis to new levels, making it possible to pair a commonly used phrase in one language with a phrase in another based on some observation of how frequently those phrases are used. So statistically based processing is clearly useful. But it’s equally clear from seeing so many opaque social media analyses that it’s insufficient.

Structuring textual data, as with numerical data, is important. Enterprises cannot get to the web of data if the data is not in an analysis-friendly form—a database of sorts. But even when something materializes resembling a better described and structured web, not everything in the text of a social media conversation will be clear. The hope is to glean useful clues and starting points from which individuals can begin their own explorations.

Perhaps one of the more telling trends in social media is the rise of online word-of-mouth marketing and other similar approaches that borrow from anthropology. So-called social ethnographers are monitoring how online business users behave, and these ethnographers are using NLP-based tools to land them in a neighborhood of interest and help them zoom in once there. The challenge is how to create a new social science of online media, one in which the tools are integrated with the science.

Conclusion: A machine-assisted and iterative process, rather than just processing alone Good analysis requires consideration of a number of different clues and quite a bit of back-and-forth. It’s not a linear process. Some of that process can be automated, and certainly it’s in a company’s interest to push the level of automation. But it’s also essential not to put too much faith in a tool or assume that some kind of automated service will lead to insights that are truly game changing. It’s much more likely that the tool provides a way into some far more extensive investigation, which could lead to some helpful insights, which then must be acted upon effectively.

One of the most promising aspects of NLP adoption is the acknowledgment that structuring the data is necessary to help machines interpret it. Developers have gone to great lengths to see how much knowledge they can extract with the help of statistical analysis methods, and it still has legs. Search engine companies, for example, have taken pure

The tool provides a way into some far more extensive investigation, which could lead to some helpful insights, which then must be acted upon effectively.

“As Clay Shirky pointed out in 2003, influence is only influential within a context.”

—Claude Théoret, Nexalogy Environics


An in-memory appliance to explore graph data

YarcData’s uRiKA analytics appliance,1 announced at O’Reilly’s Strata data science conference in March 2012, is designed to analyze the relationships between nodes in large graph data sets. To accomplish this feat, the system can take advantage of as much as 512TB of DRAM and 8,192 processors with over a million active threads.

In-memory appliances like these allow very large data sets to be stored and analyzed in active or main memory, avoiding memory swapping to disk that introduces lots of latency. It’s possible to load full business intelligence (BI) suites, for example, into RAM to speed up the response time as much as 100 times. (See “What in-memory technology does” on page 33 for more information on in-memory appliances.) With compression, it’s apparent that analysts can query true big data (data sets of greater than 1PB) directly in main memory with appliances of this size.

Besides the sheer size of the system, uRiKA differs from other appliances because it’s designed to analyze graph data (edges and nodes) that take the form of subject-verb-object triples. This kind of graph data can describe relationships between people, places, and things scalably. Flexible and richly described data relationships constitute an additional data dimension users can mine, so it’s now possible, for example, to query for patterns evident in the graphs that aren’t evident otherwise, whether unknown or purposely hidden.2

But mining graph data, as YarcData (a unit of Cray) explains, demands a system that can process graphs without relying on caching, because mining graphs requires exploring many alternative paths individually with the help of millions of threads— a very memory- and processor-intensive task. Putting the full graph in a single random access memory space makes it possible to query it and retrieve results in a timely fashion.

The first customers for uRiKA are government agencies and medical research institutes like the Mayo Clinic, but it’s evident that social media analytics developers and users would also benefit from this kind of appliance. Mining the social graph and the larger interest graph (the relationships between people, places, and things) is just beginning.3 Claude Théoret of Nexalogy Environics has pointed out that crunching the relationships between nodes at web scale hasn’t previously been possible. Analyzing the nodes themselves only goes so far.

1 The YarcData uRiKA Graph Appliance: Big Data Relationship Analytics, Cray white paper, http://www.yarcdata.com/productbrief.html, March 2012, accessed April 3, 2012.

2 Michael Feldman, “Cray Parlays Supercomputing Technology Into Big Data Appliance,” Datanami, March 2, 2012, http://www.datanami.com/datanami/2012-03-02/cray_parlays_supercomputing_technology_into_big_data_appliance.html, accessed April 3, 2012.

3 See “The collaboration paradox,” Technology Forecast 2011, Issue 3, http://www.pwc.com/us/en/technology-forecast/2011/issue3/features/feature-social-information-paradox.jhtml#, for more information on the interest graph.

Comments or requests? Please visit www.pwc.com/techforecast or send e-mail to [email protected]

Tom DeGarmoUS Technology Consulting Leader+1 (267) 330 [email protected]

Bo Parker Managing Director Center for Technology & Innovation +1 (408) 817 5733 [email protected]

Robert ScottGlobal Consulting Technology Leader+1 (416) 815 [email protected]

Bill Abbott Principal, Applied Analytics +1 (312) 298 6889 [email protected]

Oliver Halter Principal, Applied Analytics +1 (312) 298 6886 [email protected]

To have a deeper conversation about this subject, please contact:

PwC (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to develop fresh perspectives and practical advice.

© 2012 PricewaterhouseCoopers LLP, a Delaware limited liability partnership. All rights reserved. PwC refers to the US member firm, and may sometimes refer to the PwC network. Each member firm is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors. NY-12-0340

PhotographyCatherine Hall: Cover, pages 06, 20Gettyimages: pages 30, 44, 58

This publication is printed on McCoy Silk. It is a Forest Stewardship Council™ (FSC®) certified stock containing 10% postconsumer waste (PCW) fiber and manufactured with 100% certified renewable energy.

By using postconsumer recycled fiber in lieu of virgin fiber:

6 trees were preserved for the future

16 lbs of waterborne waste were not created

2,426 gallons of wastewater flow were saved

268 lbs of solid waste were not generated

529 lbs net of greenhouse gases were prevented

4,046,000 BTUs of energy were not consumed

www.pwc.com/techforecast

Subtext

Culture of inquiry A business environment focused on asking better questions, getting better answers to those questions, and using the results to inform continual improvement. A culture of inquiry infuses the skills and capabilities of data scientists into business units and compels a collaborative effort to find answers to critical business questions. It also engages the workforce at large—whether or not the workforce is formally versed in data analysis methods—in enterprise discovery efforts.

In-memory A method of running entire databases in random access memory (RAM) without direct reliance on disk storage. In this scheme, large amounts of dynamic random access memory (DRAM) constitute the operational memory, and an indirect backup method called write-behind caching is the only disk function. Running databases or entire suites in memory speeds up queries by eliminating the need to perform disk writes and reads for immediate database operations.

Interactive visualization

The blending of a graphical user interface for data analysis with the presentation of the results, which makes possible more iterative analysis and broader use of the analytics tool.

Natural language processing (NLP)

Methods of modeling and enabling machines to extract meaning and context from human speech or writing, with the goal of improving overall text analytics results. The linguistics focus of NLP complements purely statistical methods of text analytics that can range from the very simple (such as pattern matching in word counting functions) to the more sophisticated (pattern recognition or “fuzzy” matching of various kinds).