9
Volume 20 • Number 1 • Spring 2017 Competitive Intelligence 24 Finding the Needle In the Haystack From Networking Event to Action - 8 Lessons from Applied WIN LOSS Analysis Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Abstract The aim of this paper is to offer a survey of text mining applications in the field of competitive intelligence, as a way to advocate for its use alongside other CI methods within the corporate environment. Unstructured data are believed to represent the largest amount of all the data relevant for businesses, and their importance is expected to keep increasing in the next years; however, these data are not yet widely used by businesses as a source to gather competitive strategic intelligence, despite offering fresh new opportunities ready to be exploited. The two main reasons behind this trend are a general lack of knowledge about the potential strategic value of unstructured data, which are not yet fully recognized as a crucial component of a comprehensive competitive intelligence strategy, and the challenges that characterize the extraction and the analysis of unstructured information. Text mining methods, which can be leveraged to mine high- quality information from textual data stored in a variety of unstructured sources, such as social media networks, patents, surveys and competitors’ websites, can help enhance corporate strategic knowledge and represent a real game changer within a business’ competitive intelligence strategy. Introduction In the last decade, the data-driven business landscape has changed from one where structured data, a set of highly organized data, such as those contained in relational databases and spreadsheets like SQL and Excel, were dominant; to one where semi-structured and unstructured data, those lacking a recognizable structure, has emerged as not only the biggest in terms of size, but also as the fastest growing component of all the data being generated today. Unstructured data are in fact generally believed to represent around 80% of all business information, although by Marco Scacchi

EVENT - WordPress.com · competitive intelligence that enhances business decisions, reduces costs, increases performance and augments a company’s competitive advantage. From a CI

  • Upload
    ngodang

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

25www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence24

Finding the Needle In the Haystack Finding the Needle In the Haystack

EVENT

From Networking Event to Action - 8 Lessons from Applied WIN LOSS AnalysisFrom Networking Event to Action - 8 Lessons from Applied WIN LOSS Analysis

FROMTO

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

Abstract The aim of this paper is to offer a survey of text mining applications in the field of competitive intelligence, as a way to advocate for its use alongside other CI methods within the corporate environment. Unstructured data are believed to represent the largest amount of all the data relevant for businesses, and their importance is expected to keep increasing in the next years; however, these data are not yet widely used by businesses as a source to gather competitive strategic intelligence, despite offering fresh new opportunities ready to be exploited. The two main reasons behind this trend are a general lack of knowledge about the potential strategic value of unstructured data, which are not yet fully recognized as a crucial component of a comprehensive competitive intelligence strategy, and the challenges that characterize the extraction and the analysis of unstructured information. Text mining methods, which can be leveraged to mine high-quality information from textual data stored in a variety of unstructured sources, such as social media networks, patents, surveys and competitors’ websites, can help enhance corporate strategic knowledge and represent a real game changer within a business’ competitive intelligence strategy.

IntroductionIn the last decade, the data-driven business landscape has changed from one where structured data, a set of highly organized data, such as those contained in relational databases and spreadsheets like SQL and Excel, were dominant; to one where semi-structured and unstructured data, those lacking a recognizable structure, has emerged as not only the biggest in terms of size, but also as the fastest growing component of all the data being generated today. Unstructured data are in fact generally believed to represent around 80% of all business information, although

by Marco Scacchi

27www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence26

different studies, including one carried out in 2011 by the IDC, estimate that non-structured data might represent as much as 90% of all business-relevant information. Similarly to what the forecasts are for the general data domain, the data originating from unstructured sources is also expected to keep augmenting in the future. In particular, the amount of unstructured data is believed to increase at an average rate of approximately 62% per year, and by 2022, unstructured data might comprise 93% of all the existing digital data. While this increase should not come as a surprise, as the sources of unstructured data increase themselves, the peculiarity that still characterizes unstructured data is that the majority of it still remains hidden and unused by many businesses.

Should these data remain hidden, or should they be exploited? And what does this unstructured data deluge entail for businesses and in particular for the Competitive Intelligence field? First of all, it must be said that the ability to extract valuable intelligence from business-related unstructured data remains a challenging task and many businesses are not yet fully exploiting

An indifference towards unstructured data is hard to understand, as unstructured data are found in documents, webpage contents, surveys, clickstreams, emails, patents, videos, and in social media contents such as tweets, blogs and Facebook posts, to name but a few. Some studies estimate that every two seconds five million emails are sent and almost fifteen thousand tweets are published. It is also estimated that two million blog posts are published every day, alongside almost five billion pages that currently exist in the Indexed Web. These statistics are not comprehensive, as they do not count the number of private corporate documents and webpages protected by firewalls, and the Invisible Web, which is estimated to contain approximately other 550 billion individual documents. The growth of unstructured data is not linear, and this rate of increase might change in the future, however, these numbers give us a clear picture of the so-called unstructured data deluge and of the opportunities that might come with it. The amount of sources available is indeed staggering, and many believe that unstructured data might really be a game changer in competitive intelligence if compared to the spreadsheets and the SQL-based relational databases that still represent

the bread and butter of most analysts today. This becomes even more apparent from a corporate perspective if we accept the 80:20 ratio of unstructured to structured data, as it means that probably a sizable number of businesses are gaining value from analysing only 20% of their information, a small fraction of their entire available data. It seems clear then that a huge potential exists from the analysis of other non-traditional sources,

this possibility, yet they are surrounded by these kind of data. A 2015 IDG Enterprise study found that more than 80% of IT professionals believed that structured data projects represented a top priority for their organization, whereas only 43% of them viewed unstructured-based data projects as a company priority, suggesting that many businesses might still be unaware of the potential hidden behind unstructured data, or might be unable to fully harness such a wealth of information. These opinions towards unstructured data analysis can be corroborated by the results of a poll conducted by KDnuggets a few years earlier, in 2011, where it was found that 35% of the subjects interviewed never used text analytics to mine unstructured data. What is particularly interesting in this poll, is that the respondents were data miners, a group which can be inferred is likely to be among the most willing, compared to other professional segments, to adopt text mining processes on work-related projects. If such a low rate of adoption exists among data miners, we can confidently assume that the rate of adoption of text mining processes must be even lower in all those businesses that do not employ data miners and other similar professionals.

such as the typically text-heavy unstructured data. These data represent a source from where to extract useful patterns, uncover hidden relationships, carry out descriptive and predictive analysis, and subsequently gain actionable competitive intelligence that enhances business decisions, reduces costs, increases performance and augments a company’s competitive advantage. From a CI point of view, this might represent an extremely valuable addition to the already established methods used to gain critical information about a company’s competitive landscape.

Whether we analyse this data deluge from a general business perspective or from a more focused angle, like a competitive intelligence point of view, it is clear that while the volume of textual data is rapidly increasing, businesses’ capability to summarize, understand, and make sense of such data for making better business decisions remains a general challenge. While the amount of information stored in unstructured sources increases day after day, the amount of information a person can process remains stable over time. With the temporal character of information becoming more and more critical as the amount of unstructured data increases, in order to gain a competitive advantage, the ability to efficiently extract actionable intelligence from textual data without the need for anyone to read the text becomes paramount from a business’s strategy perspective.

Several studies have been published on the opportunity this data deluge may offer to businesses, while few have considered it from a strictly CI perspective. The aim of this short paper is to try to shed some light on this topic, offering a (non-exhaustive) survey of text mining applications in the field of competitive intelligence.

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

29www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence28

Defining Text MiningTo give an exact definition of what text mining means is not an easy task, not only because it is a young and rapidly evolving field, but also because it is a domain which has developed from a group of related but independent disciplines as shown in figure 1.1. The Text Mining Handbook, by Feldman and Sanger defines text mining as a process “to extract useful information from data sources through the identification and exploration of interesting patterns.” In other words, text mining, often called also text analytics, includes an array of technologies to analyse semi-structured and unstructured text data and extract relevant information that can then be analysed to provide the user with valuable intelligence without having to read any text document. Being a young (especially from a practical business point of view), and rapidly evolving discipline, text mining can mean different things for different competitive intelligence professionals (CIPs). To further

• Document Classification is a process based on statistics, data mining, and machine learning techniques. It is one of the most important methods within the text mining field, and it works by building a classification model which assigns a predefined category or class to a document using information ‘learnt’ from a previously known model.

• Web Mining, which is based on natural language processing and document classification techniques is a new and exciting but rapidly evolving field in the text mining environment. Web mining is usually identified as an independent practice area within text mining, because of the huge amount of data available on the World Wide Web. The importance of web mining is forecasted to increase, as it is linearly dependent on the increase of content generated from the web.

• Information Extraction (IE), also called (text summarization) can be described as a process to extrapolate structured information from semi-structured and unstructured text. These unstructured information can be facts, events, terms and attributes of the terms. Unlike other practice areas, information extraction requires a significant degree of domain expertise in linguistics to be applied successfully.

complicate matters, the different techniques within text mining represent interrelated but distinct components with no clear dominance among them, and with a heterogeneous degree of maturity. For all these reasons, it might be useful to provide a framework of the competing techniques or processes used in text mining.

Seven Topic Areas in Text Mining In Practical Text Mining and Statistical Analysis for Non-Structured Text Data Application, written by Miner et al., 2012 a useful framework is provided to understand what is ‘behind’ text mining from a business-centric point of view. The authors of the book identify seven different complementary practice areas, or techniques, that encompass the field of text mining. These topic areas exist at the intersection of text mining and other disciplines that contributed to its creation and development. The unifying theme behind all these techniques is their contribution in the process of extracting knowledge from unstructured and semi-structured data.

• Search and Information Retrieval (IR): Search and information retrieval comprises indexing, searching, and retrieving information stored in large databases in the form of textual data using a set of keywords, sometimes enhanced using a thesaurus. In other words, information retrieval allows to decrease the number of documents relevant to a specific problem. With the steeply increase in the use of Google and other search engines, IR has nowadays become a common domain for most competitive intelligence professionals.

• Document Clustering is an automated process that applies an unsupervised learning technique, cluster analysis, to group documents that have shown some similarities into a number of different clusters.

• Natural Language Processing (NLP) refers to a field that draws from computer science, computational linguistics, and machine learning. It uses complex algorithms to process and understand human speech as it is spoken or written.

• Concept Extraction is a process that groups phrases and words into distinct groups with semantically similar characteristics.

These seven practice areas should not be viewed as independent and separated concepts in the application of text mining processes. They overlap frequently, as several applications of text mining are based upon a range of different techniques, as it is shown in figure 1.2. For instance, sentiment analysis, one of the most known processes in text mining, draws from the areas of document classification, concept extraction and natural language processing; while document ranking is based on techniques originating from web mining, document classification, and information

Figure 1.1

Figure 1.2

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

31www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence30

retrieval. While professionals carrying out text mining analyses need to have a deeper technical understanding of these techniques, decision-makers might find more useful to broadly understand which are the applications to use depending on a certain desired outcome or data to analyse. Table 1.3 gives a general overview of the text mining techniques needed for different final products.

Where can Text Mining be used for Competitive Intelligence?Most businesses recognize information as a crucial asset in strategic planning and decision making. While until not long ago the size and pace of information creation and sharing were more easily manageable, nowadays these dynamics are changing, and the temporal

Text mining is becoming an important feature in supporting competitive intelligence processes. According to a study carried out in 2011, 33% of text analysis activites are in fact driven by competitive intelligence. While text mining can be used as a standalone process to leverage unstructured data for CI use, in order to fully exploit the value of unstructured information, text mining processes should be integrated with various traditional frameworks used in CI, such as the Five Forces Analysis (FFA), the competitive positioning analysis, benchmarking analysis, Porter’s Four Corners Model, PESTLE, and SWOT analysis. The integration of these processes can in fact enable a better understanding of the competitive landscape of a business.

Relevant data sources used in competitive intelligence analysis that can be mined using text mining techniques are different and significant, and include companies’ websites, social media platforms (Facebook, Twitter, Instagram, blogs etc.), annual reports, press releases, financial analysis, online newspaper articles, surveys, speeches, and patent filings, to name but a few. This paper will focus on the use of text mining on social media networks, surveys, competitors’ websites and medical patents.

APPLICATIONS OF TEXT MINING

Mining Social Networks in the Context of Competitive Intelligence. The last decade has seen an incredible increase in social media usage across the globe, as a consequence, more and more companies nowadays are using social media networks, such as Twitter, LinkedIn and Facebook to interact with their customers. Social media, however, is also used by many businesses as a preferred channel to disseminate news about new products, new hires, and services which sometimes are not shared, or shared later, on other ‘traditional’ sources, such as companies’ press releases and online newspapers. The wide adoption of social media tools engenders a considerable amount of strategic user and company-generated content on a daily basis. This information is often stored in textual forms and represents a true source of hidden knowledge not readily available without further analysis. This aspect, coupled with the fact that social media can be effectively used to predict real world outcomes has greatly increased both the interest around this particular topic, and the attitudes toward the usefulness of social media-related insights, which are increasingly taken into account by decision-makers.

Table 1.3

character of information is becoming more and more imperative, so much that a piece of intelligence relevant today might become irrelevant tomorrow. The ability to collect real-time competitive intelligence information has become so important that systematically failing to do so might result in losing market share and revenues to competitors. This is where text mining comes into the picture. Text mining can in fact not only process unstructured and semi-structured data, but it can also help speed up and refine several competitive intelligence functionalities, such as search, filtering, event alert, and analysis. The use of text mining techniques can therefore help a business gain a competitive edge over its competitors. The general text mining model that supports the process of transferring data into actionable intelligence and strategic decision is shown in figure 1.4.

Figure 1.4 From Data to Decision Support bt TMCISs [3,26]

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

33www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence32

An example of an event detection topic for competitive intelligence might be a company launching a new product, recruiting a new type of staff or establishing a new facility. Event detection consists of three separate processes: (i) event topic reasoning, (ii) event property extracting, and (iii) similarity comparison. Event topic reasoning classifies the unstructured text into an event category; event property extraction extracts the event textual information, such as competitor’s name, name of the event, date and time, and other general and technical information. Event similarity comparison determines if the event or similar events have already occurred and eliminates duplicates. Event detection is not only a descriptive process, but can also be used as a predictive analytics tool, as it can help anticipate future actions through the detection of early warning signs, the so-called foresight studies, an important component in CI.

• Abundance of Data: The amount of information extracted and processed from unstructured data might be massive, and a sufficient storage space might be required.

• Relevance and Noise in the Data: Non-relevant textual data, spam, fake opinions, spelling errors and jargons all pose major challenges when mining social media data. All this noise needs to be taken into account when apply text mining processes to gather competitive strategic intelligence.

• Population Bias: A fundamental limitation of using social media mining for some competitive intelligence applications is sampling bias. Unfortunately, this is an aspect to which is not usually given enough importance when using the information collected from social media networks. Although the huge amount of data is extremely useful to detect sentiments, events, and for other social media monitoring processes, make inferences that apply to the general population using a sample of social media users is a gross statistical error, as social media users are usually not a representative sample of the general population. For instance, Twitter users tend to skew towards urban, young, minority individuals and to use the data collected from tweets to make inferences on a general population is a statistical error.

• Extreme Opinions: The use of some social media data for businesses’ decisions sometimes needs to be integrated with other sources to have a solid representation of the reality, as in some cases there might be possible overrepresentation of extreme opinions. For instance, only highly-satisfied and highly-dissatisfied customers might express their feedback on social media platforms, while ‘average’ customers might not be interested

Social media is one of the main driving forces behind the expansion of text mining, as it is used to gain insights into current and future market trends, industry dynamics, brand reputation, voice of the customer, market sizing and customer behaviours. Social media text mining has several different applications. Some of the most interesting from a competitive intelligence point of view are outlined below:

• Sentiment Analysis – For the sake of simplicity, we can define sentiment analysis, also known as opinion mining or polarity analysis, as a process used to discover opinions, emotions, attitudes and behaviours behind a certain topic, company, service or product. More broadly, it can be defined as a process to extract opinion-centric information from social media platforms. Sentiment analysis is based on complex algorithms and subtleties based on statistics, machine learning and natural language processing (NLP) techniques which estimate social media interaction’s sentiment (it can be a Facebook post, a tweet, a LinkedIn share) by counting the frequency associated with negative, neutral, or positive words. The final results of a sentiment analysis will provide the general sentiments, opinions, and affective states of people / customers on a specific object. Opinion mining is a powerful process, and it is an important method to capture strategic competitive intelligence either on an ad hoc basis or over a longer period of time to understand how different products, services, brands or companies are perceived by customers.

• Event Detection and Social Tagging – The goal of event detection and social tagging is to discover and monitor the frequency of current events or possible future events on a specific source or several different sources.

• Other Social Media Monitoring Processes – Alongside opinion mining and event detection, text mining processes can also be applied to the social media realm to gather intelligence on other specific aspects. For instance, among other capabilities, it is possible to capture real-time customer choices, identify influencers and key opinion leaders, monitor and predict the success of marketing campaigns, and track other general trends over time.

• Discover Patterns Related to Structured Data – Text mining techniques can be taken a step further with the integration of textual data mining outcomes with structured data to uncover hidden relationships. For instance, integration between text mining social media findings (consumer sentiments and opinions), events (e.g. competitors’ ad campaigns) and structured data such as sales, number of followers, posts, comments, likes etc. can be used to understand current and future trends, possible correlations or causations and provide useful information for strategic decision

making. For instance, one of the questions that might be answered using this hybrid approach would be how feedback is correlated with customer purchase behaviour.

Social media text mining is a powerful process; however, it might also present a range of challenges that competitive intelligence professionals need to be aware of when analysing unstructured data from social media:

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

35www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence34

instance, opinions may change radically over a short span of time, going from positive to negative and vice-versa. Analysts need to keep this in mind when analysing data from social media.

• Short Length: Some social media platforms have restrictions in place on the length of user-created content. One of the most known social platforms to apply such a restriction is Twitter, (despite recent changes on what kinds of content count toward the character limit). These sometimes do not provide sufficient information, and might represent a challenge for the extraction of actionable intelligence.

• Content-Quality Variance: A fundamental difference between textual data found in social media when compared to other traditional media is the variance of the quality of the content. The quality of the content on social media varies greatly between high-quality and low-quality, and this aspect makes the filtering of data and the extraction of business-useful intelligence more complex

it might be possible to discover new customers’ behaviours and refine a company’s market segmentation.

Web Mining: Mining Competitors’ Websites A company’s website alone can be a strategic source of competitive intelligence as it might include information on current capabilities, supply chain, customers, new product launches, links to conferences and company’s reports, and strategic information on future job openings. With competitors’ information being increasingly available on the internet, extracting information from companies’ webpages has become a critical process to gather competitive intelligence. However, manually extracting actionable intelligence from competitors’ websites presents a number of difficulties: (i) the number of webpages to analyse might be so large that it is unfeasible to search for them manually, and (ii) actionable intelligence, hidden relationships and general patterns between entities need a collective analysis to be discovered. Information and search retrieval tools, be they web search engines, such as Google or companies’ internal searches engines, can help restrict the amount of information to process. However, due to the number of sources on the web and the temporal characteristic of information, text mining applications that can extract and mine valuable information from the web have become more and more popular in recent years, and are referred to under the broad term of web mining.

But what is exactly web mining? Similarly, to what happens with text mining, there is not a single accepted definition of web mining. Nevertheless, most scholars identify it as a process used to define the analysis of text contained in webpages, which unlike text-based documents are both semi-structured and unstructured sources. While web mining is essentially the same as text mining

in leaving a comment on a social media platform.

• Shifting Social Media Platform: Social media users can swiftly change their preferred platform. In order to be sure to capture the widest array of consumers’ opinions, competitive intelligence professionals need to be aware of new social media channels.

• Languages: Many text mining tools support different languages, so that information from unstructured data can be extracted from a variety of sources. However, as the processed data will need to be evaluated by an analyst, language skills remain critical for competitive intelligence professionals.

• Time Sensitivity: A peculiar characteristic of social media networks is their real-time nature. For instance, bloggers may update their content once a week, while social media users can share hundreds of posts within a day. This massive content generation goes alongside a rapid evolution of the type of content shared and of the communication styles used. For

if compared to other domains. Extracting Actionable Insights from Surveys Text mining can also be used to extract and categorize text-based information found in surveys. Surveys represent a great way to decrease the population bias when trying to capture customers’ sentiment or behaviours and carry out inferential statistical analysis (which means to generalize your sample results to a wider population), because, unlike social media networks, a more robust survey sampling process can be built and utilized. However, manually coding text responses from surveys to place them into a set of categories can take many days or even weeks, depending on the amount of data and the complexity of text responses. Text mining can speed up this process and help extract useful information from surveys characterized by a large amount of data. Text mining techniques are particularly useful with surveys that include open-ended questions (e.g. consumer research), as they can help extract meaning from what people say. The reasons why open-ended questions should be used when designing a survey are outside the scope of this paper, however, it should be noted that to gather valuable competitive strategic intelligence, a mix of open-ended and closed questions represents the best strategy to follow.

Alongside the use of text mining techniques on surveys to capture sentiments, behaviours, and trends, text mining of survey data can also help with customer segmentation when used in conjunction with other survey data, as it can help discover hidden patterns that can then be used to refine a business’ knowledge about its customers. For instance, by extracting specific terms or concepts, grouping them together by type, counting the relative frequencies of words, applying text clustering to discover new likely relevant topics, and correlation between terms,

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

37www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence36

from server logs; in other words, it can be thought of as a process used to discover and predict the behavioural pattern of the users.

It is apparent that web mining can be extremely useful for both external and internal-based competitive intelligence. For instance, web content mining can help ‘scrape’ competitors’ websites in order to process a lot of information in a shorter amount of time compared to traditional methods, while web usage mining can effectively retrieve information about the behaviours of a company’s own customers and help predict their future trends.

Mine Medical Patents for Competitive Intelligence Every competitive intelligence professional that has worked on scientific projects involving the analysis of medical patents knows that patents are complex, difficult, and time-consuming documents to analyse. However, patents can reveal a great deal of information about the competitive landscape, especially in light of the fact that the amount and type of intelligence found in patents might not be available in other sources, such as journal articles, scientific abstracts, and medical websites and blogs. A number of important differences, in both the content and in the writing style and structure, exist between patents and other scientific articles and medical publications, and text mining strategies applied on patents represent a specialized challenge worth analyzing in more detail. ‘Patent Mining’, as this particular form of text mining has become known, can be a winning approach to build a comprehensive competitive analysis, as it can be used to extract actionable intelligence on a competitor’s technological development, on its new products, drugs, or industrial processes, on its strengths and weaknesses, on its R&D efforts and directions, and more generally to understand

patent.

• Chemical Name Identification: Identification of chemicals in patents represents a further important step, as patented chemical compounds play a vital role for the economics of many medical industries, and represent crucial competitive intelligence information. To enhance this identification process, semantic similarity and image-to-structure conversion methods can also be applied to find similar terms and extract information from images and graphs.

• Relation Extraction: The third step is to search for relations between terms. Usually this is achieved by extracting co-occurrence of names in the patent.

• Classification and Clustering: Supervised (classification), and unsupervised (clustering) techniques can also be applied to a patent document.

Further advancements in patent mining are needed to accomplish the same degree of development found in other text mining areas (such as social media networks). Among future possible research developments for patent mining, the most pressing seem to be (i) further research on the best methods to analyse co-occurrence of terms in distinct sections of a patent (ii) text mining techniques to analyse the noise linked to the increasing multilingualism nature of patents, and (iii) integration of data from patents with other unstructured medical documents (e.g. scientific articles) and structured sources to enhance the competitive intelligence efficacy of information mined from patents.

(it is indeed often used interchangeably – the reason behind the confusion over its meaning), a relatively unstructured mining and analytics field has developed around the web documents in particular, so that web mining has developed in a rather confined niche. Web mining is comprised of three distinct processes: (i) web content mining, (ii) web structure mining, and (iii) web usage mining.

• Web content mining is used to capture useful insights from web page contents. It is essentially a text mining process applied to web pages, and it involves the extraction of information from both unstructured web data in the form of text, videos, pictures, and graphs, and from structured content such as tables. The main challenges with web content mining are the following:

- Each website presents similar information in a different way; identify and grouping semantically homogenous data within a website can prove difficult.

- Web content mining can produce a lot of noise in the form of website advertisements, copyright notes, navigation links etc. Filtering out all of this noisy information might require a lot of time.

• Web structure mining is a process used to show the relationship between the user and the web, by finding useful information from the structure of hyperlinks. The main purpose of web structure mining is therefore to retrieve hidden information between webpages.

• Web usage mining, also known as web log mining, involves the discovery of user access patterns through the gathering of information

market trends and develop strategies for R&D, product marketing, and intellectual property management. However, mining data from patents presents a set of unique challenges. As mentioned above, the type of language used in patents is complex, and can entail significant challenges to the application of patent mining. The semantic meaning of technical terminologies used in patents it is often found to be inconsistent for three main reasons:

• Lack of homogeneity in terminology used. Some technical terms have several semantically equivalent reference names.

• Absence of standard nomenclature for new discoveries. At the early stages of new technological developments, inventors often use different names to identify the same technology in the patents.

• Existence of an abundant number of patents written in different languages than English. While the majority of scientific articles are published in English, the number of patents written in languages different from English is increasing, and automatically translated patents might present several technical terms containing translation errors that can harm the extraction of high-quality information.

These semantic aspects pose a real challenge to the full extraction of intelligence, however, a series of common text mining strategies to follow when extracting textual data for medical patents exist and might prove useful:

• Named-Entity Recognition: Names pertaining to specific domains, such as terms identifying names of compounds, cell types, and tissues are automatically recognized within the

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

39www.scip.orgVolume 20 • Number 1 • Spring 2017Competitive Intelligence38

some privacy concerns when these techniques are used to analyse personal data. Issues might also arise from the use of internal or cross-company data sources, which might not be accessible or fully searchable depending on specific corporate policies.

• False Expectations: Management might have the idea that text mining software can magically extract ready-to-use intelligence to solve strategic challenges and easily answer questions about the competitive landscape of a business. While text analytics can provide a lot of valuable information which might be impossible to have access to with other approaches, the reality is that text mining not only needs human interaction, but also integration with traditional competitive intelligence techniques to provide the best results.

• Text Mining and Decision-Making: The use of text mining-produced intelligence from unstructured data for strategic decision-making might need a real cultural change within some businesses. For instance, businesses which are used to gather intelligence manually and through more traditional sources and procedures might be sceptical at first of the process or of the results provided through the use of text mining techniques.

CHALLENGES

Challenges from a Management Perspective Along the course of this paper we have come across some of the technical challenges of applying text mining processes to gather competitive strategic intelligence. Nevertheless, alongside the more ‘technical’ difficulties analysts encounter when collecting, analysing and evaluating the findings obtained from a text mining process, there are a series of issues that competitive intelligence professionals might want to consider from a management perspective:

• Training and Skills: The skills needed to set up a text analytics department will vary depending on the needs of the business. While opinion mining analysis might need less technical skills, if a business is interested in building predictive models using the data analysed through text mining, a sound base of statistics will be needed. Alongside these skills, it might also come in handy to know at least a programming language, so that the results of an analysis can be reproduced and re-analysed using one of the many open source programming languages and software environments that have text mining packages available. Language skills might also represent a competitive advantage, especially if a business is interested in gathering competitive intelligence in some particular linguistic areas where no sizable operations and staff are present.

• Data Access and Legal Challenges: Alongside the herculean task of analysing huge amount of unstructured data, there is also the issue of gaining access to data sources. Not all data sources can in fact legally be used. For instance, web mining might be against the terms of use of some websites and might raise

ABOUT THE AUTHOR:Marco Scacchi currently works as a research analyst at M-Brain UK. He previously worked as a graduate market intelligence analyst at Thales Alenia Space Italy, and

as a project manager and market intelligence analyst at the Italian Society for International Organization.

He holds a Master’s Degree in Political Economy of the Middle East from King’s College London, a M.Sc. in Economic Security, Geopolitics and Intelligence from the Italian Association for the United Nations and a B.Sc. in Political Science with a major in demography.

Marco has a passion for business analytics, business statistics and programming languages, and he is the founder and editor of the ‘Data Crunching Intelligence 2.0’ blog.

REFERENCES:(2003). “Meeting the Challenge of Text. Making Text Ready for Predictive Analytics”. SPSS. Accessed February 17, 2017 at http://www.spss.ch/upload/1107356538_Making%20text%20ready%20for%20predictive%20analysis.pdf

(2015). “Big Data and the Challenge of Unstructured Data”. Ciklum. Accessed February 19, 2017 at https://www.ciklum.com/blog/big-data-and-the-challenge-of-unstructured-data/“Analyzing survey text: a brief overview”. IBM. Accessed February 17, 2017 at http://www.besmart.company/wp-content/uploads/2014/11/briefoverview01.pdf)

Berry Michael W., Kogan Jacob (2010). “Text Mining. Applications and Theory”. John Wiley & Sons Ltd., UK

Boswell, Wendy (2016). “How Big Is The Web? How Many Websites Are There?” Lifewire. Accessed February 17, 2017 at https://www.lifewire.com/how-big-is-the-web-4065573

Culotta, Aron. “Reducing Sampling Bias in Social Media Data for County Health Inference “. Accessed February 17, 2017 at (http://cs.iit.edu/~culotta/pubs/culotta14reducing.pdf)

Do Prado Hercules Antonio, Ferneda Edilson (2008). “Emerging Technologies of Text Mining: Techniques and Applications”. IGI Global, USA

Feldman Ronen, Sanger James (2007). “The Text Mining Handbook”. Cambridge University Press, Cambridge, UK

Gantz John, Reinsel David, (2011). “Extracting Value from Chaos.” IDC IVIEW. Accessed February 19, 2017 at http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf

Gantz John, Reinsel David, (2012). “The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East”, IDC IVIEW. Accessed February 19, 2017 at https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf

Gemar, German, Jimenez-Quintero, Jose Antonio (2015). “Text mining social media for competitive analysis”. Tourism & Management Studies. Volume 11, issue 1.

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining Competitive Intelligence and Unstructured Data: The Business Case for Text Mining

Competitive Intelligence40

SCIP CLIPS:

5 MINUTE BUSINESS TIPSDELIVERED BY SCIP’S CEO

NAN BULGER

GOT 5 MINUTES?

Po Hu, Minlie Huang, Peng Xu, Weichang Li, Adam K. Usadi, and Xiaoyan Zhu (2012). “Finding Nuggets in IP Portfolios: Core Patent Mining through Textual Temporal Analysis”. Accessed February 17, 2017 at https://pdfs.semanticscholar.org/8c2a/331e20f7e10ee4b8a45b14e81e3f0f03e4d7.pdf

Rodriguez-Esteban Raul, and Bundschus Markus (2016). “Text mining patents for biomedical Knowledge”. Drug Discovery Today. Volume 21, Number 6.

Saini Shipra, and Pandey Hari Mohan (2015). “Review on Web Content Mining Techniques”. International Journal of Computer Applications. Volume 118, No. 18. Accessed February 17, 2017 at http://research.ijcaonline.org/volume118/number18/pxc3903536.pdf

Siddiqui, Bilal (2014). “Survey text mining with IBM SPSS Text Analytics for Surveys, Part 1: Exploring sample survey data“. IBM. Accessed February 17, 2017 at http://www.ibm.com/developerworks/library/ba-spss-survey-text-mining1/)

Song Min, Brook Wu Yi-Fang (2009). “Handbook of Research on Text and Web Mining Technologies”. IGI Global, USA

Verma Jai Prakash, Agrawal Smita, Patel Bankim, and Patel Atul (2016). “Big Data Analytics: Challenges and Applications for Text, Audio, Video, and Social Media Data” International Journal on Soft Computing, Artificial Intelligence and Applications (IJSCAI), Vol.5, No.1. Accessed February 17, 2017 at http://aircconline.com/ijscai/V5N1/5116ijscai05.pdf

Xue, Yun, Zhou, Yilu, and Dasgupta, Subhasish (2015). “Examining Competitive Intelligence Using External and Internal Data Sources: A Text Mining Approach”. Twenty-first Americas Conference on Information Systems, Puerto Rico. Accessed February 17, 2017 at http://aisel.aisnet.org/cgi/viewcontent.cgi?article=1051&context=amcis2015

Ziegler Cai-Nicolas (2012). “Mining for Strategic Competitive Intelligence”. Springer, Berlin

Accessed February 17, 2017 at http://www.scielo.mec.pt/pdf/tms/v11n1/v11n1a10.pdf

Grimes, Seth (2011). “Text/Content Analytics 2011: User Perspectives on Solutions and Providers” Alta Plana. Accessed February 19, 2017 at http://altaplana.net/TextAnalyticsPerspectives2011.pdf

Halper Fern, Kaufman Marcia, Kirsh Daniel (2013). “Text Analytics: The Hurwitz Victory Index Report”. Hurwitz & Associates. Accessed February 17, 2017 at http://www.sas.com/news/analysts/Hurwitz_Victory_Index-TextAnalytics_SAS.PDF

Halper, Fern (2012). “Top 5 Challenges of Text Analytics”. All Analytics. Accessed February 19, 2017 at http://www.allanalytics.com/author.asp?section_id=2013

He, Wu, Zha Shenghua, Li Ling (2013). “Social media competitive analysis and text mining: A case study in the pizza industry”. International Journal of Information Management. Volume 33, Issue 3. Accessed February 17, 2017 at http://ac.els-cdn.com/S0268401213000030/1-s2.0-S0268401213000030-main.pdf ?_tid=aebaca60-eaca-11e6-ae48-00000aab0f6b&acdnat=1486206871_a59709a7ac55b9777ac8876f595b8926

http://www.internetlivestats.com/

http://www.worldwidewebsize.com/

Jaikumar, Vijayan (2015). “Solving the Unstructured Data Challenge”. CIO. Accessed February 19, 2017 at http://www.cio.com/article/2941015/big-data/solving-the-unstructured-data-challenge.html

Maynard Diana, Bontcheva Kalina, and Rout Dominic. “Challenges in developing opinion mining tools for social media “. Accessed February 17, 2017 at https://gate.ac.uk/sale/lrec2012/ugc-workshop/opinion-mining-extended.pdf)

Miner, Gary et al (2012). “Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications” Academic Press, USA

Perrin, Andrew (2015). “Social Media Usage: 2005-2015”. PewResearchCenter. Accessed February 17, 2017 at http://www.pewinternet.org/2015/10/08/social-networking-usage-2005-2015/

Competitive Intelligence and Unstructured Data: The Business Case for Text Mining