Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Tutorial on
Computational
SociolinguisticsInteraction between society, language, data and algorithms
Anshul Bawa, Monojit ChoudhuryMicrosoft Research India
Outline of the Tutorial
• A brief introduction to Computational Sociolinguistics [60 min]
• What, why and how?
• Some of our research in the area
• A hands-on demo on use of Bollywood scripts for CSL studies [60 + 20 min]
• Critique of methods, Q&A and open discussion [20 min]
We will learn
• About sociolinguistics
• How to design large-scale data-driven social science studies• Problem formulation
• Identifying the data
• Preparing the data
• Use of machine learning and statistics for analyzing the data
• Interpreting the results
• About stereotyping in Bollywood ;-)
Sociolinguisticsis the study of the effect of any and all aspects of society, including cultural norms, expectations, and context, on the way language is used, and society's effect on language.
Sociolinguistics study the correlation between SOCIAL variables and LINGUISTIC variables
AgeGenderGeographical regionEducationOccupationEconomic ClassRural/UrbanPolitical OrientationSexual Orientation
PronunciationIntonation/ProsodyVocabularyLanguage ChoiceSyntactic ConstructsPragmatic Constructs
The Formality Continuum
HighLow
Casual speechText: Legal documents
Printed Text: Literature, News
Chat, SMSFB Comments
Tweets
Blog
Thanks to
Social
Media…
• Speech data is expensive; social media is a good proxy
• Personal conversations
• Socially grounded data
What’s common to all these Tweets?
• ni som har möjlighet att delta i missing people's sökande eftersnälla snälla gör det!!!!
• Adik… sem brape boleh bwak kenderaan? normal parent question – UiTMLendufornia
• jit fi la fin du mois de decembre kan ljaw bared ktir wttalj
They are all Code-Mixed
• ni som har möjlighet att delta i missing people's sökande eftersnälla snälla gör det!!!!
• Adik… sem brape boleh bwak kenderaan? normal parent question – UiTMLendufornia
• jit fi la fin du mois de decembre kan ljaw bared ktir wttalj
Code-Mixing or Code-Switching is mixing of more than one language in a single conversation or utterance.
MULTILINGUAL SOCIETIES provide unique and interesting challenges
• Code-switching
• Language preference
• Linguistic Accommodation of Language choice
50% of world’s population are multilingual
Zo vs. Ruuh
Good. Aur tum kaiseho?
I m feeling sleepy
Hey, Whaddup?!
um ok…ignoring and moving along...
Me too yaar.
Insomnia is the worst.
Wait, what?
wanna count sheep?
wassup!
The Mélange Team: Kalika Bali, Monojit Choudhury, Sunayana Sitaram, Indrani Medhi Thies, Anshul Bawa, Adithya Pratapa, Brij Srivastava
Past members: Ashutosh Baheti, Shruti Rijhwani, Royal Sequeira, Chandra Maddila and a bunch of interns
Project Mélangehttps://www.microsoft.com/en-us/research/project/melange/
How much do people around the world CODE-SWITCH?
Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique.
Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, Chandra Maddila
ACL 2017
Hindi-English Code-Switching on Social Media
In public pages from Facebook (of Indian celebrities, movies and BBC Hindi News)
• ALL sufficiently long threads were multilingual• 17.2% of the comments/posts have code-mixing
Bali et al. I am borrowing ya mixing: An analysis of English-Hindi Code-mixing in Facebook. 1st Workshop on Computational Approaches to Code-switching, EMNLP 2014
Worldwide language distribution of monolingual and code-switched tweets computed over 50M Tweets (restricted to the 7 languages)
3.5% tweets are
code-switched
Fraction of monolingual English tweets is strongly negatively correlated (-0.85) with the fraction of code-switched tweets
This is surprising … especially for extremely multilingual US cities (e.g., Houston)
(?) ACCULTURIZATION takes place much faster in the US
Dilemma of Multilinguals: What language should I Tweet in?
Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?
Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury, Niloy Ganguly
EMNLP 2016
When and why do multilinguals prefer a certain language?
It’s unpredictable!
Topic change
Puns
Emphasis
Emotion
Reported Speech
Language Preference
Hypothesis: Hindi-English Bilinguals use Hindi for expressing
emotional content whereas English for expressing facts
We might praise you in English, but gaali to Hindi me hi denge!
Study of 830K Tweets from Hi-Enbilinguals
1. The native language, Hindi, is strongly preferred (10 times more) for negativity and swearing
2. English is used far more for positive sentiment than negative
3. Language change often corresponds with changing sentiment
Hindi
English
Fraction of tweets with swear words
Some Remarks
Inferences drawn using majority-language data are likely to be misleading for multilingual societies
Intriguing sociolinguistic questions from our observations
- English is highly preferred for positive sentiment expression –because it’s the language of aspiration in India?
- Variance across different multilingual communities and languages
- Is social media actually representative of society?
For more information …
• Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé, and Franciskade Jong. "Computational Sociolinguistics: Survey."Computational linguistics (2016)
• The Code-mixing Blog: Poco Mix Maadihttps://pocomixmaadi.wordpress.com
• Talk by Dr. Animesh Mukherjee [11:30 – 12:00]: Language of social media: hashtags, topics and mixing
Language, individual and the society
Functions of Language
Dynamics of Language
Structure of Language
Interaction between Language & Society
Functions of Language
Dynamics of Language
Structure of Language
Functions of Society
Dynamics of Society
Structure of Society
Sociology of Language
Sociolinguistics
Online Social
Networks
From an Individual’s perspective (node)
• Can we use NLP to predict individual’s• Moods and Mental state
• Habits and Behavior
• Demographic attributes – gender, ethnicity, region and language, education
• Health: Mental, physical
• Language acquisition
Example 1: Individual
Predicting Depression via Social Media. M De Choudhury, M Gamon, S Counts, E Horvitz. ICWSM 2013
• Crowdsourcing to compile a set of Twitter users who report being diagnosed with clinical depression, based on a standard psychometric instrument.
• Through their social media postings over a year preceding the onset of depression, measure behavioral attributes relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant medications.
• Leverage these behavioral cues to build a statistical classifier
“decrease in social activity, raised negative affect, highly clustered ego networks, heightened relational and medicinal concerns, and greater expression of religious involvement.
From Society’s perspective (whole network)
• Language evolution• Diffusion of linguistic innovation
• Effect of Social influence on language change
• Prevalence of certain traits: smoking, depression or swearing
• Correlation between traits and demographic factors
Examples of code-mixing and code-choice that we covered earlier.
From a group’s perspective (community)
• Dominance hierarchy
• Dialectal features (slangs, lingos)
• Homogeneity vs. language use
• Inclusivity
• New member dynamics
• Social ostracizing and outcasting
From a Relationship’s perspective (edge)
• Can we use NLP to predict• Dominance
• Formality
• Politeness
• Threats, humiliation, stalking
• Accommodation
Language and Social Interaction Social contexts of conversation
• Addressee/Audience
• Topic
• Social goals of speakersSpeaker agency
Individual variation
Social norms
Performativity Style
Style-shifting theories• Accommodation/Style Matching
• Audience Design
Linguistic StyleDefined vis-à-vis 'content' or communicative intent
• What you say vs how you say it• Same intent can be conveyed in
multiple ways
Variants associated with social meaning – syntactic, lexical or phonological
• choice of words, syntax, utterance length, pitch and gestures
“Ending a sentence with a
preposition is something up
with which I will not put.”
―Winston S. Churchill
Linguistic Accommodation
Speakers can shift their behavior to become more similar or more different to their conversation partners
Why do speakers accommodate?
Convergence reduces the social distance between speakers and makes one look more favorable and cooperative
Coordination is often unconscious
"like a dance"
Style Accommodation
• Posture
• Pause length
• Utterance length
• Self-disclosure
• Head nodding
• Backchannels
• ...
• Linguistic Style
Measuring Linguistic Style
• LIWC psycholinguistic categories
Feature family Examples
Prepositions at, to, with
Articles the, an, a
Auxiliary verbs maybe, perhaps
Conjunctions and, whereas
Mark my words! Linguistic style accommodation in social media C Danescu-Niculescu-Mizil, M Gamon, S Dumais Proceedings of WWW 2011, 745-754
Main bhi sahihoon yar!
I am doing great man!
Dude, How are you?
Good good. AurYar tum batao?
I am doing great dude!
Symmetric Accommodation
Default Asymmetry
Divergent Asymmetry
MeasuringAccommodation
• How do you know when is speaker A accommodating to speaker B?
• What if they just have similar styles – homophily
What we want – Does B's use of a style x trigger the use of x by A?
Does A use x more when B uses x more?
Accommodation at various levels
• How much does an average speaker accommodate?
• How much does speaker A accommodate?
• How much does speaker A accommodate to speaker B in particular?
• How different is that from how much B accommodates A?
• Do speakers accommodate on style xmore than style y?
Language Choice as Style
“If you talk to a man in a language
he understands, that goes to his
head. If you talk to him in his
language, that goes to his heart.”
― Nelson Mandela
Language Choice as Style
• Do speakers coordinate on language choice?
• Do speakers accommodate for one language more than another?
• In what ways is accommodation of language different from other linguistic styles, and why?
How to Design a Computational Sociolinguistics study?
Data Collection
Automatic Processing of
Data
Statistical Analysis
Hypothesis2. Problem formulation
1. Literature Survey
4. Automatic large scale collection
3. Spotting appropriate data
Nature of Data
• Curated content - Wikipedia, Blogs, News articles
• Social Media - Facebook, Twitter, Reddit
• Personal Chats -Whatsapp, Telegram, Emails
• Spoken conversations -Voicemail, Call recordings, Transcripts
• Modality
• Formality
• Audience size
• Context/setting/domain
• Privacy
Collecting SM Data
• Manual Scraping• Applicable to: Everything except private conversations
• Cons: Labor and time intensive less data
• Pros: Selective data; useful for platforms that do not allow automatic scraping
• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.
• Through platform-supported APIs: Twitter, Reddit
• Cons: Only available for some platforms; less control over variables
• Pros: Enormous amount of data
• Hybrid Techniques
Collecting SM Data
• Manual Scraping• Applicable to: Everything except private conversations
• Cons: Labor and time intensive less data
• Pros: Selective data; useful for platforms that do not allow automatic scraping
• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.
• Through platform-supported APIs: Twitter, Reddit
• Cons: Only available for some platforms; less control over variables
• Pros: Enormous amount of data
• Hybrid Techniques
Collecting SM Data
• Manual Scraping• Applicable to: Everything except private conversations
• Cons: Labor and time intensive less data
• Pros: Selective data; useful for platforms that do not allow automatic scraping
• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.
• Through platform-supported APIs: Twitter, Reddit
• Cons: Only available for some platforms; less control over variables
• Pros: Enormous amount of data
• Hybrid Techniques
Privacy, Usage and Sharing
• Be aware of the Privacy Issues and Usage Policy of the data• Most publicly viewable data is accessible for download and usable for
research. BUT …
• On the other hand, most private data (like my Facebook posts shared with only Friends or a subgroup) cannot be used for any research.
• Who owns the data: Admin? Users? Platform?
• Be extra cautious while carrying out studies on sensitive demographic attributes and reporting results
• Anonymity should be strictly maintained during data sharing/reporting
• Take permission from the ethics committee of your organization
Hindi Movie Scripts
tia tum apne bhai se itna jhaagadte kyun ho ?
arjun we have a pretty messed up relationship .
tia how come ?
arjun tum personal sawaal bahut poochti ho ?
tia dost aise hi bante hain .
arjun you don't give up .. do you ?
tia nope , i don't ! tell me ---
arjun
paanch saal pehle , maine apni pehli novel likhnishuru ki thi ... aur mera bhai ki pehle novel flop ho chuki thi aur woh apni doosri novel pe atka hua tha.. lekin ek saal baad , jab uski second novel publish hui toh it was a best- seller .. problem sirf ye thi kiwoh novel almost exactly mera story idea tha ..
tia noooo ..
Measuring Accommodation of Language Choice
1. Think of what features can you extract from the data that will be useful.
2. How will it vary across turns/speakers?
3. When can you say it is being accommodated? State a hypothesis.
4. Try to state it as a formula.
What we want – Does B's use of a style x trigger the use of x by A?
Does A use x more when B uses x more?
• Need to know the language of every dialog word
• Number or ratio of words from a language, per dialog
Measuring Accommodation of Language Choice
How to Design a Computational Sociolinguistics study?
Data Collection
Automatic Processing of
Data
Statistical Analysis
Hypothesis2. Problem formulation
1. Literature Survey
4. Automatic large scale collection
3. Spotting appropriate data
6. NLP techniques for modeling and
prediction
5. Preprocessing of Data
7.Hypothesis testing
Steps of Data Processing
Raw dataStructured
data + Meta data
Text analytics
Rule-based and Heuristics
Machine Learning based
NLP tools
Unstructured to structuredMore
Skills, commitment, focus and momentum in a cohesive team is the winning formula for startups. #startup #Entreprenuership
0 replies 0 retweets 2 likes
Reply Retweet Like 2
Marvin Danig @marvindanig · Jan 21
Tooth ache 😖.
0 replies 0 retweets 0 likes
Reply Retweet Like
Aasish Pappu Retweeted
Graham Neubig @gneubig · Jan 23
More
One paper doing tons of NLP tasks with neural nets: analogies, relations, co-reference, translation http://boballen.info/RBA/PAPERS/NL-BP/nl-bp.pdf … . From 1987...
0 replies 19 retweets 63 likes
Reply Retweet 19 Like 63
mrbrown @mrbrown · 11h11 hours ago
More
If I hear another reference to simi Koo Ji Koo Ji on Channel 8, I will punch the tv.
tia tum apne bhai se itna jhaagadte kyun ho ?
arjun we have a pretty messed up relationship .
tia how come ?
arjun tum personal sawaal bahut poochti ho ?
tia dost aise hi bante hain .
arjun you don't give up .. do you ?
tia nope , i don't ! tell me ---
arjun
paanch saal pehle , maine apni pehli novel likhnishuru ki thi ... aur mera bhai ki pehle novel flop ho chuki thi aur woh apni doosri novel pe atka hua tha.. lekin ek saal baad , jab uski second novel publish hui toh it was a best- seller .. problem sirf ye thi kiwoh novel almost exactly mera story idea tha ..
tia noooo ..
NLP tools for Social Media
• Normalization
• Systems/techniques specifically built for SMD.
Normalizationdis za twt This is a tweet. Std. En-Hi MT यह एक ट्वीट है।
En-Hi Tweet MT System
dis za twt यह एक ट्वीट है।
Fortunately, there are ready-made basic tools
• http://www.cs.cmu.edu/~ark/TweetNLP/• tokenizer,
• part-of-speech tagger,
• hierarchical word clusters,
• dependency parser
• annotated corpora
• web-based annotation tools.
Hardly anything for Indian languages
If specific tools have to be built, typically you would:
1. Use Machine Learning Systems
2. Annotate SM data3. Leverage on existing tools and
resources for standard language
4. Use other tools and resources for SM data
Word-level Language Detection
Dilwale vs. Bajirao Mastani: Even Super-Films Get the Monday Blues
Wat n awesum movie it wazzzz! sabko dekhna chahiye
What was your favourite momentat the concert? Was war für euchder schönste Moment
TransliterationSpelling Variations
Ambiguity
Named Entities
NO Training Data
Kibrisa geldigim … god warum? ich mochte nicht hier
Tr Tr X Tr Tr Ge Ge Ge Ge
Design Choices
Unit of conversation?
Delayed accommodation?
Multi-party conversations?
Do we accommodate for all types of code-choice?
Demographic variables and styles
Concluding Remarks
• In computational social science, two types of analysis:• Structure [network]
• Content [linguistic/visual]
• Content can further be analyzed as• Information content
• Information delivery style
Not much work in terms of deeper content analysis
Write to us if you have any query or want to get access to a reading list of computational sociolinguistics papers