Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Overview
• Introduction to social text data
• Basic techniques to manipulate text data
• Typical text mining applications for social science
• Hot trend and future direction
• Conclusion
Overview
• Introduction to social text data
• Basic techniques to manipulate text data
• Typical text mining applications for social science
• Hot trend and future direction
• Conclusion
Social Text Data
• Various kinds of social media
– Text data is one of the most important data types
• Tweets
• Blogs
• Comments
• Profiles
• Messages
• ……
Characteristics of social text data
• User-oriented
– Social text data is created by users themselves
• A social networking account corresponds to an offline user in real life
• Users can generate text contents as they like
• It provides a valuable data resource to understand social users
Characteristics of social text data
• Opinionated text
– Not only including facts
Characteristics of social text data
• Informal and noisy text
– New terms were created by users
– A traditional vocabulary cannot work
Characteristics of social text data
• Rich-link text
– On Twitter, we can simply connect ourselves with any celebrity account by using two typical ways:
• mention (@) and retweet (RT)
Characteristics of social text data
• Rich-type text
– Multiple types of information are often embedded in the text
• E.g., – Hashtag
– Check-ins (Location and time)
– Phone type
Overview
• Introduction to social text data
• Basic techniques to manipulate text data
• Typical text mining applications for social science
• Hot trend and future direction
• Conclusion
Fundamental natural language processing techniques
• Turn text into semantic units
• Provide semantic units with meaningful annotations
Fundamental natural language processing techniques
• Turn text into semantic units
• Provide semantic units with meaningful annotations
Tokenization
• For English
– Delimiter-based tokenization method
– Is it a simple task?
Chinese Segmentation
• For Chinese
– No space characters help
• 他明天去香港 ->他--明天--去--香港 – (He is going to HK tomorrow.)
– Nearly a well-done task
• On standard text collections, the accuracy is very high For social text data, it needs more improving
• Challenge: informal writing styles and out-of-vocabulary terms – E.g., new terms created by social users
Fundamental natural language processing techniques
• Turn text into semantic units
• Provide semantic units with meaningful annotations
Part-of-speech tagging
• Word-category tagging/labeling/classification
Part-of-speech tagging (POS tagging)
• Word-category tagging
Named Entity Recognition (NER)
• A very important task: find and classify entity mentions in text, for example:
– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
• A very important task: find and classify entity mentions in text, for example:
– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Named Entity Recognition (NER)
• A very important task: find and classify entity mentions in text, for example:
– The decision by the independent MP Andrew Wilkie to withdraw his support for the minority Labor government sounded dramatic but it should not further threaten its stability. When, after the 2010 election, Wilkie, Rob Oakeshott, Tony Windsor and the Greens agreed to support Labor, they gave just two guarantees: confidence and supply.
Named Entity Recognition (NER)
Person Date Location Organization
Entity Linking
• President Obama won the 2009 Nobel Peace Prize
Semantic Analysis Model
• Statistical topic models
• Word embedding models
Semantic Analysis Model
• Statistical topic models
• Word embedding models
Statistical topic models
• Question:
– Given a large text dataset, how do we know the major topics in it?
We need a way to summarize and compress information
Statistical topic models
• Output I:
– Topics
– Can be understood as “clusters of words”
Statistical topic models
• Output II:
– Per-document topic distribution
– Can be understood as “document representation”
Statistical topic models
• Put together
Key ideas in topic models
• Capturing word co-occurrence
– If two words frequently co-occur in documents, these two words tend to be captured in top positions of the same topics
– The number of training documents are supposed to be large
• A few hundred documents will not work well
Semantic Analysis Model
• Statistical topic models
• Word embedding model
Word embedding
• Distributional semantics
Word embedding
• Distributional semantics
Word2Vec
• Input: a sequence of word tokens from a vocabulary
– Requiring a large amount of documents
• Output: a fixed-length embedding vector for each term in the vocabulary
– The embedding vector usually has several hundred dimensions with dense values
Word embedding
• Usage: finding similar words
– E.g., query word=“Sweden”
Interesting Observations
• China-Beijing = Japan - Tokyo
Overview
• Introduction to social text data
• Basic techniques to manipulate text data
• Typical text mining applications for social science
• Hot trend and future direction
• Conclusion
Typical text mining applications for social science
• Data cleaning
• Sentiment analysis
• User profiling
• User interest modeling
Typical text mining applications for social science
• Data cleaning
• Sentiment analysis
• User profiling
• User interest modeling
Data cleaning
• Cleaning social text data
– Very challenging to be perfect in practice
• E.g., word normalization – SG => Singapore
– HK => Hong Kong
– But fortunately standard techniques usually work moderately well (with slight modifications) in many cases
• E.g., word2vec and topic models can capture the above abbreviation patterns to some extent
Data cleaning
• Spam detection
– Machine learning or rule-based approach
– The key is to select effective features based on user behaviors
• Multiple postings in a short period
• Redundant or nearly redundant postings
• Irrelevant postings
Typical text mining applications for social science
• Data cleaning
• Sentiment analysis
• User profiling
• User interest modeling
Key technical components
• Opinion extraction
– “I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too.”
• What do we see?
– Opinion holders: persons who make the comments
– Opinion targets: entities and their features/aspects – Sentiments: opinions towards opinion targets
Key technical components
• Polarity identification
– Simplest form:
• Is the attitude of this text positive or negative?
• A classification-based approach – Feature selection
– Opinion lexicon
– More complex forms:
• Five-star rating prediction
• A regression-based approach
For ecommerce: Feature-based opinion summary
43
For each extracted aspect, we can derive the corresponding opinion score.
44
Using Twitter sentiment for stock price prediction
Do
w J
on
es • The CALM
mood predicts DJIA 3 days later
CA
LM
Bollen et al. (2011)
Twitter sentiment versus Gallup Poll of Consumer Confidence
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In ICWSM-2010
Typical text mining applications for social science
• Data cleaning and preprocessing
• Sentiment analysis
• User profiling
• User interest modeling
User profiling
• Direct extraction of demographics from social profiles
– Users fill in the profiles
• Sex
• Age
• Job
• Interests
• …….
User profiling
• Implicit inference from social text data
– E.g., Amazon online reviews
“I bought my wife a new phone” => SEX = MALE
“I worked in an IT company......” => JOB = IT
Typical text mining applications for social science
• Data cleaning and preprocessing
• Sentiment analysis
• User profiling
• User interest modeling
User interest modeling
• Topical expert ranking
– Given a topic, we need to identify influential experts for that topic
– Two major components
• Interest identification and Influence ranking
• (Individual) Interest identification – Find out what topics is a celebrity interested in
• (Overall) Influence ranking – Decide how influential a celebrity is in a given topic
User interest modeling
• Topical expert ranking
Topical word cloud Topical celebrity ranking
User interest modeling
• Collective interest or attention modeling
– Event detection
Overview
• Introduction to social text data
• Basic techniques to manipulate text data
• Typical text mining applications for social science
• Hot trend and future direction
• Conclusion
Deep learning for text mining
• Deep neural models are promising
– But relying on tedious parameter tuning and large training datasets
• Shallow neural models are more practical and easier to use by non-experts
– E.g., word2vec
Future directions for text mining
• Designing more effective information extraction models
• Joint utilizing multi-source information
• Developing large-scale and efficient algorithms
Toolkit list
• Toolkits
– NLP
• http://nlp.stanford.edu/software/lex-parser.shtml
• http://cogcomp.cs.illinois.edu/page/software/
– Topic models
• https:// gibbslda.sourceforge.net
– Word2vec
• https://code.google.com/p/word2vec
Thanks!
• Acknowledgement
– Thanks for Prof. Zhu’s invitation and organization
– Thanks for your attention
Some slides are modified based on the Stanford NLP course slides.
References • https://www.coursera.org/course/nlp • https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html • Wei Shen, Jianyong Wang, and Jiawei Han. Entity Linking with a Knowledge Base: Issues,
Techniques, and Solutions. Transactions on Knowledge & Data Engineering 27(2):443--460 (2015) • Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado, Greg S.; Dean, Jeff (2013). Distributed
representations of words and phrases and their compositionality. NIPS 2013. • D. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012. • Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu, Hady Wirawan Lauw. Detecting product review
spammers using rating behaviors. CIKM 2010: 939-948 • Yoshua Bengio, Aaron Courville, Pascal Vincent. Representation Learning: A Review and New
Perspectives. http://arxiv.org/abs/1206.5538 • Jinpeng Wang, Xin Zhao and Xiaoming Li. Mining New Business Opportunities: Identifying Trend
related Products by Leveraging Commercial Intents from Microblogs. In EMNLP’13. • Xin Zhao, YanweiGuo, Yulan He, Han Jiang, Yuexin Wu and Xiaoming Li. A Demographic-based
System for Product Recommendation On Microblogs. In KDD 2014. • Jinpeng Wang, Gao Cong, Xin Zhao, Xiaoming Li. Mining User Intents in Twitter: A Semi-Supervised
Approach to Inferring Intent Categories for Tweets. In AAAI 2015. • Xin Zhao, Jinpeng Wang, Yulan He, Ji-Rong Wen, Edward Y. Chang, Xiaoming Li. Mining Product
Adopter Information from Online Reviews for Improving Product Recommendation. TKDD 10(3): 29 (2016)
• Xin Zhao, Yanwei Guo, Rui Yan, Yulan He, Xiaoming Li. Timeline generation with social attention. SIGIR 2013: 1061-1064