Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models

Mining Social Text Data

ZHAO Xin [email protected]

Renmin University of China

Overview

• Introduction to social text data

• Basic techniques to manipulate text data

• Typical text mining applications for social science

• Hot trend and future direction

• Conclusion

Overview





• Conclusion

Social Text Data

• Various kinds of social media

– Text data is one of the most important data types

• Tweets

• Blogs

• Comments

• Profiles

• Messages

• ……

Characteristics of social text data

• User-oriented

– Social text data is created by users themselves

• A social networking account corresponds to an offline user in real life

• Users can generate text contents as they like

• It provides a valuable data resource to understand social users


• Opinionated text

– Not only including facts


• Informal and noisy text

– New terms were created by users

– A traditional vocabulary cannot work


• Rich-link text

– On Twitter, we can simply connect ourselves with any celebrity account by using two typical ways:

• mention (@) and retweet (RT)


• Rich-type text

– Multiple types of information are often embedded in the text

• E.g., – Hashtag

– Check-ins (Location and time)

– Phone type

Overview





• Conclusion

Fundamental natural language processing techniques

• Turn text into semantic units

• Provide semantic units with meaningful annotations




Tokenization

• For English

– Delimiter-based tokenization method

– Is it a simple task?

Chinese Segmentation

• For Chinese

– No space characters help

• 他明天去香港 ->他--明天--去--香港 – (He is going to HK tomorrow.)

– Nearly a well-done task

• On standard text collections, the accuracy is very high For social text data, it needs more improving

• Challenge: informal writing styles and out-of-vocabulary terms – E.g., new terms created by social users




Part-of-speech tagging

• Word-category tagging/labeling/classification

Part-of-speech tagging (POS tagging)

• Word-category tagging

Named Entity Recognition (NER)

• A very important task: find and classify entity mentions in text, for example:

– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.







Person Date Location Organization

Entity Linking

• President Obama won the 2009 Nobel Peace Prize

Semantic Analysis Model

• Statistical topic models

• Word embedding models



• Word embedding models

Statistical topic models

• Question:

– Given a large text dataset, how do we know the major topics in it?

We need a way to summarize and compress information


• Output I:

– Topics

– Can be understood as “clusters of words”


• Output II:

– Per-document topic distribution

– Can be understood as “document representation”


• Put together

Key ideas in topic models

• Capturing word co-occurrence

– If two words frequently co-occur in documents, these two words tend to be captured in top positions of the same topics

– The number of training documents are supposed to be large

• A few hundred documents will not work well



• Word embedding model

Word embedding

• Distributional semantics

Word embedding

• Distributional semantics

Word2Vec

• Input: a sequence of word tokens from a vocabulary

– Requiring a large amount of documents

• Output: a fixed-length embedding vector for each term in the vocabulary

– The embedding vector usually has several hundred dimensions with dense values

Word embedding

• Usage: finding similar words

– E.g., query word=“Sweden”

Interesting Observations

• China-Beijing = Japan - Tokyo

Overview





• Conclusion

Typical text mining applications for social science

• Data cleaning

• Sentiment analysis

• User profiling

• User interest modeling


• Data cleaning


• User profiling


Data cleaning

• Cleaning social text data

– Very challenging to be perfect in practice

• E.g., word normalization – SG => Singapore

– HK => Hong Kong

– But fortunately standard techniques usually work moderately well (with slight modifications) in many cases

• E.g., word2vec and topic models can capture the above abbreviation patterns to some extent

Data cleaning

• Spam detection

– Machine learning or rule-based approach

– The key is to select effective features based on user behaviors

• Multiple postings in a short period

• Redundant or nearly redundant postings

• Irrelevant postings


• Data cleaning


• User profiling


Key technical components

• Opinion extraction

– “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.”

• What do we see?

– Opinion holders: persons who make the comments

– Opinion targets: entities and their features/aspects – Sentiments: opinions towards opinion targets

Key technical components

• Polarity identification

– Simplest form:

• Is the attitude of this text positive or negative?

• A classification-based approach – Feature selection

– Opinion lexicon

– More complex forms:

• Five-star rating prediction

• A regression-based approach

For ecommerce: Feature-based opinion summary

43

For each extracted aspect, we can derive the corresponding opinion score.

44

Using Twitter sentiment for stock price prediction

Do

w J

on

es • The CALM

mood predicts DJIA 3 days later

CA

LM

Bollen et al. (2011)

Twitter sentiment versus Gallup Poll of Consumer Confidence

Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In ICWSM-2010


• Data cleaning and preprocessing


• User profiling


User profiling

• Direct extraction of demographics from social profiles

– Users fill in the profiles

• Sex

• Age

• Job

• Interests

• …….

User profiling

• Implicit inference from social text data

– E.g., Amazon online reviews

“I bought my wife a new phone” => SEX = MALE

“I worked in an IT company......” => JOB = IT


• Data cleaning and preprocessing


• User profiling


User interest modeling

• Topical expert ranking

– Given a topic, we need to identify influential experts for that topic

– Two major components

• Interest identification and Influence ranking

• (Individual) Interest identification – Find out what topics is a celebrity interested in

• (Overall) Influence ranking – Decide how influential a celebrity is in a given topic


• Topical expert ranking

Topical word cloud Topical celebrity ranking


• Collective interest or attention modeling

– Event detection

Overview





• Conclusion

Deep learning for text mining

• Deep neural models are promising

– But relying on tedious parameter tuning and large training datasets

• Shallow neural models are more practical and easier to use by non-experts

– E.g., word2vec

Future directions for text mining

• Designing more effective information extraction models

• Joint utilizing multi-source information

• Developing large-scale and efficient algorithms

Toolkit list

• Toolkits

– NLP

• http://nlp.stanford.edu/software/lex-parser.shtml

• http://cogcomp.cs.illinois.edu/page/software/

– Topic models

• https:// gibbslda.sourceforge.net

– Word2vec

• https://code.google.com/p/word2vec

http://nlp.stanford.edu/software/lex-parser.shtml




http://cogcomp.cs.illinois.edu/page/software/

http://cogcomp.cs.illinois.edu/page/software/

https://code.google.com/p/word2vec

https://code.google.com/p/word2vec

Thanks!

• Acknowledgement

– Thanks for Prof. Zhu’s invitation and organization

– Thanks for your attention

Some slides are modified based on the Stanford NLP course slides.

References • https://www.coursera.org/course/nlp • https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Wei Shen, Jianyong Wang, and Jiawei Han. Entity Linking with a Knowledge Base: Issues,

Techniques, and Solutions. Transactions on Knowledge & Data Engineering 27(2):443--460 (2015) • Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed

representations of words and phrases and their compositionality. NIPS 2013. • D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. • Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu, Hady Wirawan Lauw. Detecting product review

spammers using rating behaviors. CIKM 2010: 939-948 • Yoshua Bengio, Aaron Courville, Pascal Vincent. Representation Learning: A Review and New

Perspectives. http://arxiv.org/abs/1206.5538 • Jinpeng Wang, Xin Zhao and Xiaoming Li. Mining New Business Opportunities: Identifying Trend

related Products by Leveraging Commercial Intents from Microblogs. In EMNLP’13. • Xin Zhao, YanweiGuo, Yulan He, Han Jiang, Yuexin Wu and Xiaoming Li. A Demographic-based

System for Product Recommendation On Microblogs. In KDD 2014. • Jinpeng Wang, Gao Cong, Xin Zhao, Xiaoming Li. Mining User Intents in Twitter: A Semi-Supervised

Approach to Inferring Intent Categories for Tweets. In AAAI 2015. • Xin Zhao, Jinpeng Wang, Yulan He, Ji-Rong Wen, Edward Y. Chang, Xiaoming Li. Mining Product

Adopter Information from Online Reviews for Improving Product Recommendation. TKDD 10(3): 29 (2016)

• Xin Zhao, Yanwei Guo, Rui Yan, Yulan He, Xiaoming Li. Timeline generation with social attention. SIGIR 2013: 1061-1064

https://www.coursera.org/course/nlp

https://www.coursera.org/course/nlp

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html




Documents

Mining Social Text Dataweblab.com.cityu.edu.hk/workshops/cityu-css-2016/... · Overview •Introduction to social text data ... Deep learning for text mining •Deep neural models