The Linguistics of Twitter - PyCon 2011 Presentation

Preview:

DESCRIPTION

'The Linguistics of Twitter' presentation from PyCon 2011 which I hope starts a dialogue about what we need to accurately measure the effects of social media.

Citation preview

American EnglishRegional Dialects

Changing Speech PatternsChanging Online Measurement

Michael D. Healymdh@michaeldhealy.comhttp://michaeldhealy.com@MichaelDHealy

@MichaelDHealy

Michael D. Healy

• Econometrics• Linguistics

• Not an Engineer

Measuring and Influencing Online and Offline Behavior

Why am I here?

This Seemed Like an Interesting Problem

@MichaelDHealy

Plan of Action

• Background• Where We Stand

o Data Collection Interlude• Historical Context• Where We May Be Going• Potential Solutions

o Sort Of

@MichaelDHealy

Introduction: Hawaiian Pidgin Video

@MichaelDHealy

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

BackgroundRegional Differences In Word Choice

@MichaelDHealy

MrEverything6's Tweet

Dallas, Texas Region

coke - Coca-Cola or soft drink in general?

Coca-Cola Probably Wants To Know

BackgroundRegional Differences In PronunciationMore Than Just Drawl

@MichaelDHealy

pin

Is that:Pin a tail on the donkey.-OR-Give me a 'pin' to write with.

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Where We Stand

@MichaelDHealy

Where We Stand

@MichaelDHealy

Detailed Dialectical MapDetailed Dialectical Map

http://aschmann.net/AmEng/

Where We Stand

@MichaelDHealy

Wait!Isn't This All Just Poor English?They Don't Speak The King's English!

1) America Doesn't Have A King

Where We Stand

@MichaelDHealy

Wait!Isn't This All Just Poor English?

2) English Doesn't Have An Authority Like:

French: L'Académie française

Spanish: Asociación de Academias de la Lengua Española

Numerous Others:http://en.wikipedia.org/wiki/List_of_language_regulators

Where We Stand

@MichaelDHealy

Who Is Right?Everyone

Prescriptive Linguistics: Tell You What Is Right

Descriptive Linguistics: Describe How You Communicate

Trying To Sell More Widgets?

Probably Descriptive Is Best

Where We Stand

@MichaelDHealy

Selected American English Dialects:• New England• Northern• North Midland• South Midland• NYC• Western• AAVE• Hawaiian Pidgin

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Historical Context

@MichaelDHealy

Linguists Thought TV Would Make Us All Sound The Same

Think Tom Brokaw

Area of

'StandardAmericanEnglish'

Not Overly LargeNot Largely Populated

Historical Context

@MichaelDHealy

Been To Wisconsin?

Seen Fargo?

Biggest Change In Spoken English Since 1750

Going On Right Now - After TV

'Oh yeah? Yeah'

Historical Context

@MichaelDHealy

Portions Of America Experience Some or All ofNorthern Cities Vowel Shift

Historical Context

@MichaelDHealy

Sum This Up:People In The Northern Cities Region Are Producing A Very Different Sounding English From Other Dialects

Historical Context

@MichaelDHealy

America Has Been Multi-Lingual Since July 9, 1776

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Where We May Be Going

@MichaelDHealy

Where We May Be Going

@MichaelDHealy

~ 74% of AmericansLive In A Megaregion

Megaregions Tied To Existing Dialect Regions

Where We May Be Going

@MichaelDHealy

William Labov, PhD.Professor of LinguisticsUniversity of Pennsylvaniahttp://www.ling.upenn.edu/~wlabov/

Pretty Much The Authority on American English Dialects

'And instead of getting a pepper-and-salt effect, we find very clear and sharp divisions between the dialects of the United States, which are getting more different from each other as time goes on.'

Plan of Action

• Background• Where We Stand• Historical Context• Where We May Be Going• Potential Solutions

@MichaelDHealy

Potential Solutions

One American Dialect Is Unique In Geography:

African-American Vernacular English (AAVE)

Not In A Geographically Contiguous Region

@MichaelDHealy

Potential Solutions

@MichaelDHealy

Center For Applied Linguistics.

"Thats the way baseball go."

Potential Solutions

@MichaelDHealy

Correct the Spelling & Grammar

import enchantfrom nltk.metrics import edit_distanceclass SpellingReplacer(object): def __init__(self, dict_name='en', max_dist=2): self.spell_dict = enchant.Dict(dict_name) self.max_dist = 2 def replace(self, word): if self.spell_dict.check(word): Return word suggestions = self.spell_dict.suggest(word)

if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist: Return suggestions[0] else: return word

Potential Solutions

@MichaelDHealy

Example 1

well im gonna go so i’ll talk to u lata 1

Corrected Example 1

Well mi Donna go so I'll talk to U late

Potential Solutions

@MichaelDHealy

Build Out a Dictionary of Words

Regex Match and Replace

proper_words = {'hater': ['enemy','jealous individual','not friend']'coke': ['coke', 'soda', 'pop']}

Which Region?

Potential Solutions

@MichaelDHealy

Example 2

well i gotta go, i’ll talk to you later aight bye 1

Potential Solutions

@MichaelDHealy

import rereplacement_patterns = [ (r'gotta', 'got to'), (r"i\'ll", 'I will'), ('aight','all right')]

class RegexReplacer(object): def __init__(self, patterns=replacement_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: (s, count) = re.subn(pattern, repl, s) return s

Potential Solutions

@MichaelDHealy

Example 2

well i gotta go, i’ll talk to you later aight bye 1

well i got to go, I will talk to you laterAll rightBye1 (!?)

Potential Solutions

@MichaelDHealy

Example 2

well i got to go, I will talk to you laterAll rightBye1 (!?)

Here '1' has the concept of: I understand

Potential Solutions

@MichaelDHealy

Solution?Bayesian Prediction Using a Custom Corpus

First Step: Tag Existing Data

import nltk.datatokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def tokenize(para): print tokenizer.tokenize(para)

Potential Solutions

@MichaelDHealy

Solution?Bayesian Prediction Using a Custom Corpus

Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol

Tokenized as:'Oo shit she called I hit ignored..neva pick up on da first call..playa rule number 23 lol'

So lots of custom work to be done . .

Potential Solutions

@MichaelDHealy

_andBeautyKills: – after tonight, don’t leave your boy roun’ me,umma #true playa fareal.

Local To SF:Neecy89: This african boy jus started askin me hella questions idk if he was tryin to be nice or tryna kill me lol

Potential Solutions

@MichaelDHealy

Geographic IndexingSimpleGeoimport simplegeo.shared, simplegeo.placesfrom simplegeo.shared import Feature

client = simplegeo.places.Client('your-oauth-token', 'your-oauth-secret')properties = {"province":"CA","city":"San Francisco","name":"SimpleGeo SF", \\ "country":"US", "phone":"+1 415 626 1375","address":"41 Decatur St", \\ "postcode":"94103"}f = simplegeo.places.Feature((37.772392, -122.405752), properties=properties)client.add_feature(f)'SG_5uZpvipNjVaSbbDv5bvZaa_37.772392_-122.405752@1291847366'

Potential Solutions

@MichaelDHealy

Geographic IndexingSimpleGeo: Queries

import simplegeo.placesdef start(lon,lat): oauth,secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') client = simplegeo.places.Client(oauth,secret) results = client.search(lon,lat) return results

def search(lon,lat,tweet) results = start(lon,lat) for word in tweet.split(): for i in results: data = i.to_dict() if word == data['properties']['name']: print data['name'],word

Potential Solutions: SimpleGeo-Tools

@MichaelDHealy

import simplegeo.placesimport simplegeo.context

class SimpleGeoAuth(object): def __init__(self): self.oauth,self.secret = open('/home/michael/.simplegeo','r').read().strip().split('\n') self.places_client = simplegeo.places.Client(self.oauth,self.secret) self.context_client = simplegeo.context.Client(self.oauth,self.secret) def SimpleGeoContextualQuery(self,lat,lon,text): geo_results = self.places_client.search(lat,lon) for word in text.split(): for geo_result in geo_results: data = geo_result.to_dict() if word == data['properties']['name']: return data['name'],word def SimpleGeoContextQuery(self,lat,lon): context_results = self.context_client.get_context(lat,lon) return context_results

Potential Solutions:Connect the APIS

@MichaelDHealy

References

@MichaelDHealy

Jacob Perkins: NLTK Master Ninja Python Text Processing with NLTK2.0 Cookbook https://www.packtpub.com/python-text-processing-nltk-20-cookbook/book http://streamhacker.com/

A Latent Variable Model for Geographic Lexical Variation. Eisenstein, J., O'Connor, B., Smith, N., and Xing, E. (2010). In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, October 2010.

You are where you tweet: a content-based approach to geo-locating twitter users. (2010). Cheng, Z., Caverlee, J., Lee, K. CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management, 2010

References

@MichaelDHealy

Repustate: Sentiment Analysis API http://repustate.com/

Rapleaf Personalization API https://www.rapleaf.com/

SimpleGeo GIS Solution API http://simplegeo.com/

Michael D. Healy SimpleGeo-Tools

@MichaelDHealy

Michael D. Healy mdh@michaeldhealy.com http://michaeldhealy.com @MichaelDHealy

SimpleGeo-Tools https://github.com/michaeldhealy/SimpleGeo-Tools