31
Text Analytics and R Open Question: A Good Match? Marina Santini (LinkedIn ) Research Scientist at SICS East Swedish ICT AB (Santa Anna) R useR MeetUp : Text analytics using R R useR group (StockholmR) R useR MeetUp, 14 March 2013, 18:00 Stockholm

Text analytics and R - Open Question: is it a good match?

Embed Size (px)

DESCRIPTION

http://www.forum.santini.se * The Quest: finding the optimal way to handle Big Textual Data for Information Discovery * The Question: is R convenient for text analytics of Big TEXTUAL Data? * Mission: identification of pros, cons, limits, benefits … Current Status: investigation in progress…

Citation preview

Page 2: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

My Quest or… Why do I attend this meetup?

– The Quest: finding the optimal way to handle Big Textual Data for Information Discovery

– The Question: is R convenient for text analytics of Big TEXTUAL Data?

– Mission: identification of pros, cons, limits, benefits …

• Current Status: investigation in progress…

Page 3: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Outline

• Big Data vs. Big TEXTUAL Data• Text Analytics & NLP (Natural Language Processing)• Statistics for linguistics with R by Stefan Th. Gries• From Information Discovery to Actionable

TEXTUAL Intelligence• The Enron Challange: Predictions and Crisis

Intelligence

Page 4: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Big Data• BIG DATA [Wikipedia]:

– Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, new platforms of "big data" tools are being developed to handle various aspects of large quantities of data.

– Examples include Big Science, web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

Page 5: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

R, Strata, Hadoop…?

Apparently many solutions are available on the market…

Uhm… Big Data is a vague label…

Page 6: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Big Unstructured TEXTUAL Data“Merrill Lynch estimates that more than 85 percent of all

business information exists as unstructured data –commonly appearing in e mails, memos, notes from call centers and ‐support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and web pages.” [DM Review Magazine, February 2003 Issue]

ECONOMIC LOSS!

A plethora of diverse document genres!

Merrill Lynch is one of the world's leading financial management and advisory companies, providing financial advice and investment banking services.

Page 7: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Simple search is not enough…

• Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms.– ”a search is made on the term felony. In a simple search, the

term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13]

Page 8: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Textual Documents and Document Genres

Page 9: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Definition: Text Analytics

• A set of NLP techniques that provide some structure to textual documents and help identify and extract important information.

Page 10: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Set of NLP techniques

• Common components of a text analytic package are:– Tokenization– Morphological Analysis– Syntactic Analysis– Named Entity Recognition– Sentiment Analysis – Automatic Summarization – Etc.

Page 11: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

NLP at Coursera

Page 12: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

NLP is pervasiveEx: spell-checkers

• Google Search• Google Mail• Facebook• Office Word• […]

Page 14: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Sentiment Analysis

Page 15: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Text Analytics Products and Frameworks

• Commercial Products:– Attensity– Clarabridge– Temis– Lexalytics– Texify– SAS– SPSS– IBM Cognos– etc.

Open Source Frameworks:• GATE• NLTK• UIMA• etc.

Page 16: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

However… (I)

• NLP tools and applications (both commercial and open source) are not perfert. Research is still very active in all NLP subfields.

Page 17: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Ex: Syntactic Parser• Connexor

• What about parsing a tweet?• “My son, 6y/o, asked me for the first time today how my

DAY was . . . I about melted. Told him that I had pizza for lunch. Response? No fair “ (Twitter Tutorial 1: How to Tweet Well)

Page 18: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Why NLP and Text Analytics are important for Information Discovery?

• Why is it important to know that a word is a noun, or a verb or the name of brand?

• Broadly speaking: • Nouns and verbs: Nouns are important for topic detection;

verbs are important if you want to identify actions or intentions.

• Adjectives = sentiment identification.• Function words (a.k.a. stop words) are important for

authorship attribution, plagiarism detection, etc.• etc.

Page 19: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

However… (II)

• At present, the main pitfall of many NLP applications is that they are not flexible enough to:– Completly disambiguate language– Identify how language is used in different types of documents

(a.k.a. genres).

For instance, in tweets langauge is used in a different way than an emails, language used in email is different from the language used in academic papers, etc. )

• Often tweaking NLP tools to different types of text or solve language ambiguity in an ad-hoc manner is time-consuming, difficult and unrewarding…

Page 20: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

How can R help?

• Can R help overcome NLP shortcomings and open a new direction in Text Analytics and Information Discovery in order to extract useful information from Big TEXTUAL Data?

Page 21: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Existing literature for linguists

• Stefan Th. Gries (2013) Statistics for linguistics With R: A Practical Introduction. De Gruyter Mouton. New Edition.

• Stefan Th. Gries (2009) Quantitative corpus linguistics with R: a practical introduction. Routledge, Taylor & Francis Group (companion website).

• Harald R. Baayen (2800) Analyzing Linguistic Data: A Practical Introduction to Statistics using R. Cambridge.

• ….

Page 22: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Companion website by Stefan Th. Gries • BNC=British National Corpus (PoS tagged)

Page 23: Text analytics and R - Open Question: is it a good match?

BNC• The British National Corpus (BNC) is a 100 million word collection of samples of

written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century, both spoken and written. The latest edition is the BNC XML Edition, released in 2007.

• The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Page 24: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

R & the BNC: Excerpt from Google Books

R = Corpus-based Lingusitc Analysis = OK1. Descriptive statistics2. Analytical statistics3. Multifactorial methods

Page 25: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

What about Information Discovery?

• Non standardized language• Non standard texts• Electronic documents of all kinds, eg. formal,

informal, short, long, private, public, etc.

Page 26: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Information Discovery Actionable Textual Intelligence

• Business Intelligence (BI) + Customer Analytics + Social Network

Analytics + Crisis Intelligence […] = Actionable Textual Intelligence

• Actionable Textual Intelligence is information that:1. must be accurate and verifiable2. must be timely3. must be comprehensive4. must be comprehensible5. !!! give the power to make decisions and to act straightaway !!!6. !!! must handle BIG BIG BIG UNSTRUCTURED TEXTUAL DATA !!!

Page 27: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

From The Economist: The Big Data scenario

Page 28: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Enron & Crisis Intelligence:The Enron Scandal

• The Enron scandal, revealed in October 2001, eventually led to the bankruptcy of the Enron Corporation, an American energy company based in Houston, Texas.

• “Enron's complex financial statements were confusing to shareholders and analysts. In addition, its complex business model and unethical practices required that the company use accounting limitations to misrepresent earnings and modify the balance sheet to indicate favorable performance. According to McLean and Elkind in their book The Smartest Guys in the Room, "The Enron scandal grew out of a steady accumulation of habits and values and actions that began years before and finally spiraled out of control. “ [wikipedia]

Page 29: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

The Enron Datasethttp://www.cs.cmu.edu/~enron/

• ” This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.”

• Resource for researchers

Page 30: Text analytics and R - Open Question: is it a good match?

The Challenge: Crisis Intelligence• Task: Can you suggest and implement a predictive model

that would tell us that the Enron CRISIS (= scandal & collapse) would have happend by analysing and processing the raw textual data of emails belonging to the Enron dataset with R?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Some basic references:•Enron scandal at-a-glance, BBC•The Enron Dataset (corpus=dataset=document collection)•A subset of about 1700 labeled email messages (4.5M ) [genre, topic, emotion]•Actionable Corpus & Actionable Intelligence (this post contains additional referenes in the cmments)

Page 31: Text analytics and R - Open Question: is it a good match?

R useR MeetUp, 14 March 2013, 18:00 Stockholm

Thank you for your attention

Preseantation available here:http://www.slideshare.net/marinasantini1/text-analytics-and-r

http://www.forum.santini.se/