View
43
Download
0
Category
Tags:
Preview:
DESCRIPTION
Lexicon: exploring language trends on Facebook Walls. Roddy Lindsay Data Team. What’s a Wall?. Walls are semi-public and public forums on profiles, groups, events, etc. Old. New. Numbers. Blogs 1.6 million posts per day (Technorati) ~18 posts per second Walls - PowerPoint PPT Presentation
Citation preview
Lexicon: exploring language trends on Facebook Walls
Roddy LindsayData Team
What’s a Wall?
Walls are semi-public and public forums on profiles, groups, events, etc.
NewOld
Numbers▪ Blogs
▪ 1.6 million posts per day (Technorati)
▪ ~18 posts per second
▪ Walls
▪ 12-20 million wall posts per day
▪ ~180 posts per second
▪ 5-9 million unique users per day
▪ 2-2.5 GB of unstructured text per day
Lexicon 101
Brief History of Lexicon▪ First iteration: “Pulse” (2006)
▪ Interests in profile fields ranked by count
▪ E.g. “Top movies in San Francisco Network”
▪ Pros
▪ Structure through comma delimitation
▪ Cons
▪ Profile information is static (not updated frequently)
▪ Limited to profile field categories (movies, books, interests, TV shows, music)
Brief History of Lexicon▪ Attempt 2:
▪ Extract terms from public and semi-public conversations between friends (on the Wall)
▪ Anonymize user data to respect privacy
▪ Plot time series data to show usage trends
▪ Pros
▪ Wall conversations closer to RL conversations
▪ Topics are constantly changing, giving a strong temporal signal
▪ Cons
▪ No structure
▪ Greater computational requirements
How does Lexicon work?▪ Count occurrences of each word and bigram that is posted each day
▪ Aggregate by unique user to minimize the effect of spam
▪ Trim the long tail to handle data explosion
▪ Normalize for intraweek and seasonal variance by putting total posts in the denominator
▪ Interactive Flash charts rolled at home (used internally and externally for all Facebook reporting products)
“apple” “apple”
How does Lexicon work?▪ More technically...
▪ Use Scribe (distributed log file aggregation service built with Thrift) to collect wall post logs from web servers
▪ Have a 180-node Hadoop cluster that loads the log files into Hive, our homegrown data warehouse sitting on top of Hadoop
▪ Pipeline of Map-Reduce scripts (written in Python) that count the number unique users for each (term, day) pair, trim the long tail
▪ Load into horizontally partitioned MySQL tier for user queries
▪ PHP front-end
▪ Memcached sits in front to cache common queries
▪ All of these are (or will be) open-source projects
▪ Facebook is an active contributor to most of these projects
Demo
What is Lexicon useful for?
What is Lexicon useful for?
▪ Tracking news
▪ Lexicon shows relative chatter surrounding current events
▪ Can understand which events are of interest to the Facebook audience
“tibet” “died” (Heath Ledger)
What is Lexicon useful for?
▪ Natural language trends
▪ Words and phrases constantly enter and exit the lexicon
▪ Track the popularity of terms that are used in everyday conversation
“lulz” “pwned”
What is Lexicon useful for?
▪ Understanding the Facebook audience
▪ Lexicon trends can yield insights into Facebook demographics, user attitudes towards Facebook products, and how the products are used
“the add”
What is Lexicon useful for?
▪ Brand Mindshare
▪ Brands and commercial products are mentioned in Wall conversations, just as in face-to-face conversations
“verizon” “juno”
What is Lexicon useful for?
▪ Categories that are social in nature yield the strongest signal
▪ Entertainment, Mobile, Automotive, QSR, etc.
“honda”, “toyota”
What is Lexicon useful for?
▪ Measuring the success of sponsored gift campaigns on Facebook
▪ Sponsored gifts: images you can send to friends along with a Wall post
“coors”
Challenges
Challenges
▪ Term disambiguation
▪ Words are used in a variety of contexts
▪ E.g. my cousin Wendy’s birthday vs. Wendy’s hamburgers
▪ Tracking each different context automatically with machine learning techniques is difficult
OR ?
▪ Language classifiers, proper tokenization, and smart cleaning of the data can get us part way there
Challenges
▪ Sentiment
▪ Is the mention of a term positive, negative, neutral, something else?
▪ Most challenging aspects: irony, ambiguous sentiment terms, complex grammar
▪ Many top companies use humans to rate a sizable percentage of posts
▪ Numerous Ph.D. candidates have quit graduate school over this problem
▪ Obviously a difficult task...
Challenges
▪ Sentiment
▪ The language on Facebook wall posts is characterized by:▪ slang, lulz
▪ mispellings
▪ blunt sentences.
▪ superfluous punctuation!!!
▪ absent punctuation for example
▪ emoticons ^_^
▪ acronyms, omg
▪ a big freaking mess
Challenges
▪ Sentiment
▪ Blunt language without complex grammar means that irony and sarcasm aren’t big issues
▪ Synonym identification (figuring out that “hotttt” == “hot”), subjective/objective classification, and tokenization are more troublesome
▪ Something to keep in mind: strong prior probability of a subjective post being positive (80-90% as rated by humans)
▪ Walls are not blogs or movie reviews
▪ Theory: users don’t want to appear to be negative, and so avoid making overtly negative comments for the most part
▪ Sentiment classifier that guesses positive every time gives the least error
▪ Maybe sentiment isn’t as important for us...
Future trends for text analytics
▪ Data visualization
▪ Graph structure/Diffusion analysis
▪ Cloud computing
Thanks!
Recommended