Lexicon: exploring language trends on Facebook Walls

Roddy LindsayData Team

What’s a Wall?

Walls are semi-public and public forums on profiles, groups, events, etc.

NewOld

Numbers▪ Blogs

▪ 1.6 million posts per day (Technorati)

▪ ~18 posts per second

▪ Walls

▪ 12-20 million wall posts per day

▪ ~180 posts per second

▪ 5-9 million unique users per day

▪ 2-2.5 GB of unstructured text per day

Lexicon 101

Brief History of Lexicon▪ First iteration: “Pulse” (2006)

▪ Interests in profile fields ranked by count

▪ E.g. “Top movies in San Francisco Network”

▪ Pros

▪ Structure through comma delimitation

▪ Cons

▪ Profile information is static (not updated frequently)

▪ Limited to profile field categories (movies, books, interests, TV shows, music)

Brief History of Lexicon▪ Attempt 2:

▪ Extract terms from public and semi-public conversations between friends (on the Wall)

▪ Anonymize user data to respect privacy

▪ Plot time series data to show usage trends

▪ Pros

▪ Wall conversations closer to RL conversations

▪ Topics are constantly changing, giving a strong temporal signal

▪ Cons

▪ No structure

▪ Greater computational requirements

How does Lexicon work?▪ Count occurrences of each word and bigram that is posted each day

▪ Aggregate by unique user to minimize the effect of spam

▪ Trim the long tail to handle data explosion

▪ Normalize for intraweek and seasonal variance by putting total posts in the denominator

▪ Interactive Flash charts rolled at home (used internally and externally for all Facebook reporting products)

“apple” “apple”

How does Lexicon work?▪ More technically...

▪ Use Scribe (distributed log file aggregation service built with Thrift) to collect wall post logs from web servers

▪ Have a 180-node Hadoop cluster that loads the log files into Hive, our homegrown data warehouse sitting on top of Hadoop

▪ Pipeline of Map-Reduce scripts (written in Python) that count the number unique users for each (term, day) pair, trim the long tail

▪ Load into horizontally partitioned MySQL tier for user queries

▪ PHP front-end

▪ Memcached sits in front to cache common queries

▪ All of these are (or will be) open-source projects

▪ Facebook is an active contributor to most of these projects

What is Lexicon useful for?

▪ Tracking news

▪ Lexicon shows relative chatter surrounding current events

▪ Can understand which events are of interest to the Facebook audience

“tibet” “died” (Heath Ledger)

▪ Natural language trends

▪ Words and phrases constantly enter and exit the lexicon

▪ Track the popularity of terms that are used in everyday conversation

“lulz” “pwned”

▪ Understanding the Facebook audience

▪ Lexicon trends can yield insights into Facebook demographics, user attitudes towards Facebook products, and how the products are used

“the add”

▪ Brand Mindshare

▪ Brands and commercial products are mentioned in Wall conversations, just as in face-to-face conversations

“verizon” “juno”

▪ Categories that are social in nature yield the strongest signal

▪ Entertainment, Mobile, Automotive, QSR, etc.

“honda”, “toyota”

▪ Measuring the success of sponsored gift campaigns on Facebook

▪ Sponsored gifts: images you can send to friends along with a Wall post

“coors”

Challenges

▪ Term disambiguation

▪ Words are used in a variety of contexts

▪ E.g. my cousin Wendy’s birthday vs. Wendy’s hamburgers

▪ Tracking each different context automatically with machine learning techniques is difficult

▪ Language classifiers, proper tokenization, and smart cleaning of the data can get us part way there

Challenges

▪ Sentiment

▪ Is the mention of a term positive, negative, neutral, something else?

▪ Most challenging aspects: irony, ambiguous sentiment terms, complex grammar

▪ Many top companies use humans to rate a sizable percentage of posts

▪ Numerous Ph.D. candidates have quit graduate school over this problem

▪ Obviously a difficult task...

Challenges

▪ Sentiment

▪ The language on Facebook wall posts is characterized by:▪ slang, lulz

▪ mispellings

▪ blunt sentences.

▪ superfluous punctuation!!!

▪ absent punctuation for example

▪ emoticons ^_^

▪ acronyms, omg

▪ a big freaking mess

Challenges

▪ Sentiment

▪ Blunt language without complex grammar means that irony and sarcasm aren’t big issues

▪ Synonym identification (figuring out that “hotttt” == “hot”), subjective/objective classification, and tokenization are more troublesome

▪ Something to keep in mind: strong prior probability of a subjective post being positive (80-90% as rated by humans)

▪ Walls are not blogs or movie reviews

▪ Theory: users don’t want to appear to be negative, and so avoid making overtly negative comments for the most part

▪ Sentiment classifier that guesses positive every time gives the least error

▪ Maybe sentiment isn’t as important for us...

Future trends for text analytics

▪ Data visualization

▪ Graph structure/Diffusion analysis

▪ Cloud computing

Thanks!

Lexicon: exploring language trends on Facebook Walls

Documents

Psycholinguistics 05 Internal Lexicon. The Internal Lexicon Internal lexicon: representation of words in permanent memory Dimensions of word knowledge

Jumada - Lane Lexicon

LEXICON - UGM

LEXICON User Guide.pdf

Gallo-Brittonic Lexicon

Ancient Hebrew Lexicon

Lean Lexicon

SF Lexicon

HP Lexicon

Package ‘lexicon’

LANE – AN ARABIC ENGLISH LEXICON An Arabic-English Lexicon

Algebraic Lexicon Grammar

RUNNING HEAD: Building a lexicon Leher Singh form lexicon...RUNNING HEAD: Building a lexicon Building a word-form lexicon in the face of variable input: Influences of pitch and amplitude

Typography Lexicon

Lexicon Kotava - French

Financial Lexicon - UNTAGuntag-smd.ac.id/files/Perpustakaan_Digital_1/FINANCE Financial lexicon A compendium of...Financial Lexicon A compendium of financial definitions, acronyms,

Multimedia Storytelling Lexicon

Peace Building Lexicon

The lexicon

Owner's Manuallexicon.com/tl_files/catalog//LuxuryCars/Lexicon/Manuals/...Lexicon Lexicon Part #070-08342 Rev 1 Printed in the United States of America Lexicon, Inc.• 3 Oak Park