75
Tutorial on Computational Sociolinguistics Interaction between society, language, data and algorithms Anshul Bawa, Monojit Choudhury Microsoft Research India

Tutorial on Computational Sociolinguistics · measure behavioral attributes relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Tutorial on

Computational

SociolinguisticsInteraction between society, language, data and algorithms

Anshul Bawa, Monojit ChoudhuryMicrosoft Research India

Outline of the Tutorial

• A brief introduction to Computational Sociolinguistics [60 min]

• What, why and how?

• Some of our research in the area

• A hands-on demo on use of Bollywood scripts for CSL studies [60 + 20 min]

• Critique of methods, Q&A and open discussion [20 min]

We will learn

• About sociolinguistics

• How to design large-scale data-driven social science studies• Problem formulation

• Identifying the data

• Preparing the data

• Use of machine learning and statistics for analyzing the data

• Interpreting the results

• About stereotyping in Bollywood ;-)

LanguageaStereotypesa

Myth and Realitya

Of Accents and Dialects

Sociolinguisticsis the study of the effect of any and all aspects of society, including cultural norms, expectations, and context, on the way language is used, and society's effect on language.

Sociolinguistics study the correlation between SOCIAL variables and LINGUISTIC variables

AgeGenderGeographical regionEducationOccupationEconomic ClassRural/UrbanPolitical OrientationSexual Orientation

PronunciationIntonation/ProsodyVocabularyLanguage ChoiceSyntactic ConstructsPragmatic Constructs

Very little data for Sociolinguistic Studies

The Formality Continuum

HighLow

Casual speechText: Legal documents

Printed Text: Literature, News

Chat, SMSFB Comments

Tweets

Email

Blog

Thanks to

Social

Media…

• Speech data is expensive; social media is a good proxy

• Personal conversations

• Socially grounded data

What’s common to all these Tweets?

• ni som har möjlighet att delta i missing people's sökande eftersnälla snälla gör det!!!!

• Adik… sem brape boleh bwak kenderaan? normal parent question – UiTMLendufornia

• jit fi la fin du mois de decembre kan ljaw bared ktir wttalj

They are all Code-Mixed

• ni som har möjlighet att delta i missing people's sökande eftersnälla snälla gör det!!!!

• Adik… sem brape boleh bwak kenderaan? normal parent question – UiTMLendufornia

• jit fi la fin du mois de decembre kan ljaw bared ktir wttalj

Code-Mixing or Code-Switching is mixing of more than one language in a single conversation or utterance.

MULTILINGUAL SOCIETIES provide unique and interesting challenges

• Code-switching

• Language preference

• Linguistic Accommodation of Language choice

50% of world’s population are multilingual

Zo vs. Ruuh

Good. Aur tum kaiseho?

I m feeling sleepy

Hey, Whaddup?!

um ok…ignoring and moving along...

Me too yaar.

Insomnia is the worst.

Wait, what?

wanna count sheep?

wassup!

The Mélange Team: Kalika Bali, Monojit Choudhury, Sunayana Sitaram, Indrani Medhi Thies, Anshul Bawa, Adithya Pratapa, Brij Srivastava

Past members: Ashutosh Baheti, Shruti Rijhwani, Royal Sequeira, Chandra Maddila and a bunch of interns

Project Mélangehttps://www.microsoft.com/en-us/research/project/melange/

How much do people around the world CODE-SWITCH?

Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique.

Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, Chandra Maddila

ACL 2017

Hindi-English Code-Switching on Social Media

In public pages from Facebook (of Indian celebrities, movies and BBC Hindi News)

• ALL sufficiently long threads were multilingual• 17.2% of the comments/posts have code-mixing

Bali et al. I am borrowing ya mixing: An analysis of English-Hindi Code-mixing in Facebook. 1st Workshop on Computational Approaches to Code-switching, EMNLP 2014

Worldwide language distribution of monolingual and code-switched tweets computed over 50M Tweets (restricted to the 7 languages)

3.5% tweets are

code-switched

Geographical Distribution of Code-switching on 8M Tweets from 24 cities

Fraction of monolingual English tweets is strongly negatively correlated (-0.85) with the fraction of code-switched tweets

This is surprising … especially for extremely multilingual US cities (e.g., Houston)

(?) ACCULTURIZATION takes place much faster in the US

Dilemma of Multilinguals: What language should I Tweet in?

Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?

Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury, Niloy Ganguly

EMNLP 2016

When and why do multilinguals prefer a certain language?

It’s unpredictable!

Topic change

Puns

Emphasis

Emotion

Reported Speech

Language Preference

Hypothesis: Hindi-English Bilinguals use Hindi for expressing

emotional content whereas English for expressing facts

We might praise you in English, but gaali to Hindi me hi denge!

Study of 830K Tweets from Hi-Enbilinguals

1. The native language, Hindi, is strongly preferred (10 times more) for negativity and swearing

2. English is used far more for positive sentiment than negative

3. Language change often corresponds with changing sentiment

Hindi

English

Fraction of tweets with swear words

Some Remarks

Inferences drawn using majority-language data are likely to be misleading for multilingual societies

Intriguing sociolinguistic questions from our observations

- English is highly preferred for positive sentiment expression –because it’s the language of aspiration in India?

- Variance across different multilingual communities and languages

- Is social media actually representative of society?

For more information …

• Nguyen, Dong, A. Seza Doğruöz, Carolyn P. Rosé, and Franciskade Jong. "Computational Sociolinguistics: Survey."Computational linguistics (2016)

• The Code-mixing Blog: Poco Mix Maadihttps://pocomixmaadi.wordpress.com

• Talk by Dr. Animesh Mukherjee [11:30 – 12:00]: Language of social media: hashtags, topics and mixing

Language, individual and the society

Functions of Language

Dynamics of Language

Structure of Language

Interaction between Language & Society

Functions of Language

Dynamics of Language

Structure of Language

Functions of Society

Dynamics of Society

Structure of Society

Sociology of Language

Sociolinguistics

Online Social

Networks

From an Individual’s perspective (node)

• Can we use NLP to predict individual’s• Moods and Mental state

• Habits and Behavior

• Demographic attributes – gender, ethnicity, region and language, education

• Health: Mental, physical

• Language acquisition

Example 1: Individual

Predicting Depression via Social Media. M De Choudhury, M Gamon, S Counts, E Horvitz. ICWSM 2013

• Crowdsourcing to compile a set of Twitter users who report being diagnosed with clinical depression, based on a standard psychometric instrument.

• Through their social media postings over a year preceding the onset of depression, measure behavioral attributes relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant medications.

• Leverage these behavioral cues to build a statistical classifier

“decrease in social activity, raised negative affect, highly clustered ego networks, heightened relational and medicinal concerns, and greater expression of religious involvement.

From Society’s perspective (whole network)

• Language evolution• Diffusion of linguistic innovation

• Effect of Social influence on language change

• Prevalence of certain traits: smoking, depression or swearing

• Correlation between traits and demographic factors

Examples of code-mixing and code-choice that we covered earlier.

From a group’s perspective (community)

• Dominance hierarchy

• Dialectal features (slangs, lingos)

• Homogeneity vs. language use

• Inclusivity

• New member dynamics

• Social ostracizing and outcasting

From a Relationship’s perspective (edge)

• Can we use NLP to predict• Dominance

• Formality

• Politeness

• Threats, humiliation, stalking

• Accommodation

Language and Social Interaction Social contexts of conversation

• Addressee/Audience

• Topic

• Social goals of speakersSpeaker agency

Individual variation

Social norms

Performativity Style

Style-shifting theories• Accommodation/Style Matching

• Audience Design

Linguistic StyleDefined vis-à-vis 'content' or communicative intent

• What you say vs how you say it• Same intent can be conveyed in

multiple ways

Variants associated with social meaning – syntactic, lexical or phonological

• choice of words, syntax, utterance length, pitch and gestures

“Ending a sentence with a

preposition is something up

with which I will not put.”

―Winston S. Churchill

Linguistic Accommodation

Speakers can shift their behavior to become more similar or more different to their conversation partners

Why do speakers accommodate?

Convergence reduces the social distance between speakers and makes one look more favorable and cooperative

Coordination is often unconscious

"like a dance"

Style Accommodation

• Posture

• Pause length

• Utterance length

• Self-disclosure

• Head nodding

• Backchannels

• ...

• Linguistic Style

Measuring Linguistic Style

• LIWC psycholinguistic categories

Feature family Examples

Prepositions at, to, with

Articles the, an, a

Auxiliary verbs maybe, perhaps

Conjunctions and, whereas

Mark my words! Linguistic style accommodation in social media C Danescu-Niculescu-Mizil, M Gamon, S Dumais Proceedings of WWW 2011, 745-754

Main bhi sahihoon yar!

I am doing great man!

Dude, How are you?

Good good. AurYar tum batao?

I am doing great dude!

Symmetric Accommodation

Default Asymmetry

Divergent Asymmetry

MeasuringAccommodation

• How do you know when is speaker A accommodating to speaker B?

• What if they just have similar styles – homophily

What we want – Does B's use of a style x trigger the use of x by A?

Does A use x more when B uses x more?

Accommodation at various levels

• How much does an average speaker accommodate?

• How much does speaker A accommodate?

• How much does speaker A accommodate to speaker B in particular?

• How different is that from how much B accommodates A?

• Do speakers accommodate on style xmore than style y?

Language Choice as Style

“If you talk to a man in a language

he understands, that goes to his

head. If you talk to him in his

language, that goes to his heart.”

― Nelson Mandela

Language Choice as Style

• Do speakers coordinate on language choice?

• Do speakers accommodate for one language more than another?

• In what ways is accommodation of language different from other linguistic styles, and why?

Hands-on Exercise

How to Design a Computational Sociolinguistics study?

Data Collection

Automatic Processing of

Data

Statistical Analysis

Hypothesis2. Problem formulation

1. Literature Survey

4. Automatic large scale collection

3. Spotting appropriate data

Nature of Data

• Curated content - Wikipedia, Blogs, News articles

• Social Media - Facebook, Twitter, Reddit

• Personal Chats -Whatsapp, Telegram, Emails

• Spoken conversations -Voicemail, Call recordings, Transcripts

• Modality

• Formality

• Audience size

• Context/setting/domain

• Privacy

Collecting SM Data

• Manual Scraping• Applicable to: Everything except private conversations

• Cons: Labor and time intensive less data

• Pros: Selective data; useful for platforms that do not allow automatic scraping

• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.

• Through platform-supported APIs: Twitter, Reddit

• Cons: Only available for some platforms; less control over variables

• Pros: Enormous amount of data

• Hybrid Techniques

Collecting SM Data

• Manual Scraping• Applicable to: Everything except private conversations

• Cons: Labor and time intensive less data

• Pros: Selective data; useful for platforms that do not allow automatic scraping

• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.

• Through platform-supported APIs: Twitter, Reddit

• Cons: Only available for some platforms; less control over variables

• Pros: Enormous amount of data

• Hybrid Techniques

Collecting SM Data

• Manual Scraping• Applicable to: Everything except private conversations

• Cons: Labor and time intensive less data

• Pros: Selective data; useful for platforms that do not allow automatic scraping

• Automatic Scraping• Crawling: Blogs, Online communities, Quora, etc.

• Through platform-supported APIs: Twitter, Reddit

• Cons: Only available for some platforms; less control over variables

• Pros: Enormous amount of data

• Hybrid Techniques

Privacy, Usage and Sharing

• Be aware of the Privacy Issues and Usage Policy of the data• Most publicly viewable data is accessible for download and usable for

research. BUT …

• On the other hand, most private data (like my Facebook posts shared with only Friends or a subgroup) cannot be used for any research.

• Who owns the data: Admin? Users? Platform?

• Be extra cautious while carrying out studies on sensitive demographic attributes and reporting results

• Anonymity should be strictly maintained during data sharing/reporting

• Take permission from the ethics committee of your organization

Hindi Movie Scripts

• 32 movies

• 20K dialogs, 240K words

• 24% words in English (!)

Hindi Movie Scripts

tia tum apne bhai se itna jhaagadte kyun ho ?

arjun we have a pretty messed up relationship .

tia how come ?

arjun tum personal sawaal bahut poochti ho ?

tia dost aise hi bante hain .

arjun you don't give up .. do you ?

tia nope , i don't ! tell me ---

arjun

paanch saal pehle , maine apni pehli novel likhnishuru ki thi ... aur mera bhai ki pehle novel flop ho chuki thi aur woh apni doosri novel pe atka hua tha.. lekin ek saal baad , jab uski second novel publish hui toh it was a best- seller .. problem sirf ye thi kiwoh novel almost exactly mera story idea tha ..

tia noooo ..

http://goo.gl/ehZjqi

Measuring Accommodation of Language Choice

1. Think of what features can you extract from the data that will be useful.

2. How will it vary across turns/speakers?

3. When can you say it is being accommodated? State a hypothesis.

4. Try to state it as a formula.

What we want – Does B's use of a style x trigger the use of x by A?

Does A use x more when B uses x more?

• Need to know the language of every dialog word

• Number or ratio of words from a language, per dialog

Measuring Accommodation of Language Choice

Measuring Accommodation of Language Choice

Data processing

How to Design a Computational Sociolinguistics study?

Data Collection

Automatic Processing of

Data

Statistical Analysis

Hypothesis2. Problem formulation

1. Literature Survey

4. Automatic large scale collection

3. Spotting appropriate data

6. NLP techniques for modeling and

prediction

5. Preprocessing of Data

7.Hypothesis testing

Steps of Data Processing

Raw dataStructured

data + Meta data

Text analytics

Rule-based and Heuristics

Machine Learning based

NLP tools

Unstructured to structuredMore

Skills, commitment, focus and momentum in a cohesive team is the winning formula for startups. #startup #Entreprenuership

0 replies 0 retweets 2 likes

Reply Retweet Like 2

Marvin Danig @marvindanig · Jan 21

Tooth ache 😖.

0 replies 0 retweets 0 likes

Reply Retweet Like

Aasish Pappu Retweeted

Graham Neubig @gneubig · Jan 23

More

One paper doing tons of NLP tasks with neural nets: analogies, relations, co-reference, translation http://boballen.info/RBA/PAPERS/NL-BP/nl-bp.pdf … . From 1987...

0 replies 19 retweets 63 likes

Reply Retweet 19 Like 63

mrbrown @mrbrown · 11h11 hours ago

More

If I hear another reference to simi Koo Ji Koo Ji on Channel 8, I will punch the tv.

Twitter JSON Format

tia tum apne bhai se itna jhaagadte kyun ho ?

arjun we have a pretty messed up relationship .

tia how come ?

arjun tum personal sawaal bahut poochti ho ?

tia dost aise hi bante hain .

arjun you don't give up .. do you ?

tia nope , i don't ! tell me ---

arjun

paanch saal pehle , maine apni pehli novel likhnishuru ki thi ... aur mera bhai ki pehle novel flop ho chuki thi aur woh apni doosri novel pe atka hua tha.. lekin ek saal baad , jab uski second novel publish hui toh it was a best- seller .. problem sirf ye thi kiwoh novel almost exactly mera story idea tha ..

tia noooo ..

Text Analytics

• Language detection

• Entity Recognition

• Sentiment Analysis

• Topic

NLP tools for Social Media

• Normalization

• Systems/techniques specifically built for SMD.

Normalizationdis za twt This is a tweet. Std. En-Hi MT यह एक ट्वीट है।

En-Hi Tweet MT System

dis za twt यह एक ट्वीट है।

Fortunately, there are ready-made basic tools

• http://www.cs.cmu.edu/~ark/TweetNLP/• tokenizer,

• part-of-speech tagger,

• hierarchical word clusters,

• dependency parser

• annotated corpora

• web-based annotation tools.

Hardly anything for Indian languages

If specific tools have to be built, typically you would:

1. Use Machine Learning Systems

2. Annotate SM data3. Leverage on existing tools and

resources for standard language

4. Use other tools and resources for SM data

Word-level Language Detection

Dilwale vs. Bajirao Mastani: Even Super-Films Get the Monday Blues

Wat n awesum movie it wazzzz! sabko dekhna chahiye

What was your favourite momentat the concert? Was war für euchder schönste Moment

TransliterationSpelling Variations

Ambiguity

Named Entities

NO Training Data

Kibrisa geldigim … god warum? ich mochte nicht hier

Tr Tr X Tr Tr Ge Ge Ge Ge

Measuring Code-Choice Accommodation

Critique of methods, Q&A and open discussion

Design Choices

Unit of conversation?

Delayed accommodation?

Multi-party conversations?

Do we accommodate for all types of code-choice?

Demographic variables and styles

Concluding Remarks

• In computational social science, two types of analysis:• Structure [network]

• Content [linguistic/visual]

• Content can further be analyzed as• Information content

• Information delivery style

Not much work in terms of deeper content analysis

Write to us if you have any query or want to get access to a reading list of computational sociolinguistics papers

[email protected]

[email protected]