Text analytics in social media

Text Analytics In Social

Media

Content

• Introduction to text mining in relation with social media

• Unique features of texts in social media

• Applying Text analytics in social media

• Example of text analytics in social media

Text Mining and

Social Media

The picture here shows the 10 top sites

that generates a lot of traffic. And

majority are under the social media

umbrella.

Social media can then be said to be a

medium whereby information and

communication can be accessed, shared

and discussed

Text Mining and

Social Media

Category Representatives sites

Wiki Wikipedia, Scholarpedia

Blogging Blogger, LiveJournal, Wordpress

Social news Digg, Briefing, Mixx, Slashdot

Micro Blogging Twitter, Google Buzz

Opinion & Reviews ePinions, Yelp

Question Answering Stack Overflow, Yahoo! Answers,

Quora

Media Sharing Flickr, Youtube

Social Bookmarking Delicious, CiteULike

Social Networking Facebook, LinkedIn, MySpace

The table shows the various

categories where we could classify

social media.

It contains various types of

services thereby resulting into

various kinds of data format.

The information in most social

media site are in text format.

Text Mining and Social Media

• With the current trend of Data Mining techniques and Business intelligence

from data, this question arises relating to social media.

“How can I get valuable information from the texts in

social media platform?”

Unique features of texts in social media

• With different kind of social media, there would definitely be some distinct

characteristics of this text and how they occur.

• Text Analytics describes a set of linguistic, statistical, and machine learning

techniques that model and structure the information content of textual

sources for business intelligence, exploratory data analysis, research, or

investigation

• This section gives us a hint on how to answer our previous question.


• Text preprocessing is making the input more consistent to facilitate text representation. text preprocessing methods include stop word removal and stemming.

• Features Generation/ Text Representation. The most common ways is to transform them into numeric vectors. Its representation is called BOW or VSM.

• Knowledge Discovery: Where we apply machine learning or data mining methods to discover pattern or insight.


• Time Sensitivity.

An important and common feature of many social media services is their real-

time nature. Bloggers may update their post every x nos of days but most

networking sites gets updates regularly like in minutes.

The text in social media is not an independent and identically distributed data

anymore due to the sensitivity and timeliness of the textual data.


• Short Length

As short messages enhances the participation of users on social media sites, it

poses a great challenge in mining with clustering or classification as a large

number of text provide sufficient context information for effective similarity

measure which is a basis for many text processing methods.

Example. Twitter is limited to 140 characters, Windows Live messenger is

limited to 512 characters but Facebook has 63,026 characters.


• Unstructured Phrases

The main challenge posed by content in social media sites is the fact that the

distribution of quality has high variance: from very high-quality items to low-

quality. This can be attributed to the people’s attitudes when posting a

microblogging message or answering a question in a forum.

The difficulty here is how to accurately identify the semantic meaning from

more than 1 word that’s been abbreviated.

Applying Text analytics in social media

• Event detection

• Event Detection aims to monitor a data source and detect the occurrence of an event

that is captured within that source

• Collaborative Question Answering:

• Analyzing the differences between conversational questions and informational

questions

Illustrative Example.

• This example illustrates how to utilize text analytics to solve problems identified in its application to social media.

• We want to improve the short text representation quality by integrating semantic knowledge resources found to be useful in dealing with the semantic gap.

• This has 3 steps:

• Seed Phase Extraction

• Semantic features Generation

• Feature Space Construction.

Seed Phase Extraction

• Problem Statement

• Given a sentence level feature T = {t1,t2,…tn}, the phrase levels ti contained in

T. The similarity between the ti and {t1,t2,…,tn} is given by:

InfoScore(ti) = 𝒋=𝟏,𝒋≠𝒊𝒏 𝒔𝒆𝒎(𝒕𝒊, 𝒕𝒋)

t* = 𝒂𝒓𝒈 𝐦𝐚𝐱𝒕𝒊 ∈{t1,t2,…tn

}

𝑰𝒏𝒇𝒐𝑺𝒄𝒐𝒓𝒆(𝒕𝒊)

Where t* is denoted as the phrasal level feature

Semantic features Generation

• Now the seed phrases has been extracted in the first step.

• What this steps aim to achieve is to generate semantic features on the seed

phrases. What the seed phrase has help us to do is to obtain an informative

and effective basic representation of the input text

• We use Wikipedia as our target social media.

Algorithm

Problem Statement:

Given a set of seed phases from a

text corpus already preprocessed,

generates the semantic features

from the text.

Feature Space Construction

• For the sake of data quality, effectiveness and valuable original information,

we conduct 2 more important basic steps in this process.

• Feature filtering to refine meaningless features

• Feature selection to avoid aggravating the “curse of dimensionality”


• Feature Filtering

For the Wikipedia example, we formulate rules to refine the unstructured

features. Some rules could be

Remove features generated form too general seed phrases.

Transform features e.g List of hotels >>>hotels

Remove features related to chronology.


• Feature Selection

• We need to select semantic features to construct feature space for various

tasks.

• The number of needed features is determined by specific tasks.


• First we calculate the tf-idf weights of all generated features. term

frequency–inverse document frequency, is a numerical statistic that is

intended to reflect how important a word is to a document in a collection

or corpus.

• One seed phrase may generate k semantic features denoted by {fi1,fi2,…,fik}.

• The selection here is one seed phase, one feature

fi* = arg max

𝑓𝑖𝑗∈{𝑓𝑖

1,𝑓𝑖

2,…,𝑓𝑖𝑗}

𝑡𝑓_𝑖𝑑𝑓(𝑓𝑖𝑗)


• Second the top n features are extracted from the remaining semantic features

based on their frequency.

• These frequently appearing features, together with the features from the first

step, are used to construct the m+n semantic features.

Finally

• With all the processes, and the feature space generated, we can then apply

text clustering or any other text analytics methods.

• In conclusion, though research is still intense on this subject, nevertheless

this short presentation has opened the way for us on how to apply text

analytics in social media resources.

References: [Aggarwal_C.,_Zhai_C._(eds.)]_Mining_Text_Data Ch. 12

Data & Analytics

Text analytics in social media