Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bringing Coherence to Chaos
Automated Analysis on Large-‐Scale Data
for the 2015 World Cup
Catherine Havasi Luminoso
Text Analy6cs World, San Francisco April 1, 2015
Big Data
Topgold, CC-‐by
Interest in Big Data
We got to do text analytics for the World Cup!
CC-‐by Berndt Meyer
Big Data – The Vs
• Volume – Scale of the data
• Velocity – Analysis of streaming data
• Variety – Data from different sources
• Veracity – How messy is the data
Big Data – The Vs
• Volume – Scale of the data
• Velocity – Analysis of streaming data
• Variety – Data from different sources
• Veracity – How messy is the data
THINGS WE DON’T KNOW
HOW TO HANDLE
But do any of us really have big data?
Will we have big text in the future?
4/1/15 Confiden6al � www.luminoso.com 9
How much text is there now?
• Much of the text is thrown away when it’s received or stored without indexing.
• Most things stored in the back of virtual closets are untouched. – The problem here is formaQng and data alignment.
• OSen people working with big text think of Social Media. – Are blogs and reviews “social media” or just social networks?
“Public Social Media” == Twitter
This from the last day of the World Cup.
0
10
20
30
40
50
60
70
80
90
100
Twi\er Open Facebook Google plus
So what does Twitter look like?
• We have collec6vely sent 300 billion tweets. – That’s 30 tb of text on twi\er, ever.
• Now, we send 500 million tweets per day. – Some of this is bots. How many is up to much, much debate. (What is a bot?)
– 40% of Twi\er users don’t tweet • Twi\er is super interna6onal.
– Only 24% is from North America – Many are from Asia (33%), esp. Japan
That’s not BIG data
Source: XKCD
Twitter can be high volume
Record: 143,199 tweets per second.
And Spiky. And repetitive.
143,199 SEMANTICALLY IDENTICAL tweets per second.
This can be a problem during the World Cup
And of course, spammy.
But what is spam?
Twitter is getting crowded
James Clifford, CC-‐by
Why not just count words?
Language is Creative
It was really stuffy.
Smelled really musty. Reminds me of a dusty closet.
Was like a wet dog.
It was like it had been shut away for a long time.
Smells like an old house.
Really stale.
It smelled terrible.
MIT Media Lab!
E14-474 - Digital Intuition!conceptnet5.media.mit.edu!
Common Sense Computing
To understand a new word or concept, we first compare it to other concepts we already know. We then make analogies to those other concepts.
How do we understand concepts?
Digital Intuition
• General knowledge is a great star6ng point • It probably doesn’t include everything the computer should know about your domain
• Luminoso’s approach: – Start with ConceptNet – Modify it based on the way words are used in your domain
Luminoso
Sony’s One Stadium
The One Stadium Use Case
LUMINOSO
Removes duplicates
Emphasizes meaningful discussion
Eliminates spam
Clusters conversation
Analyzing Twitter
LUMINOSO ONE STADIUM LIVE
www.luminoso.com 9 Luminoso Technologies Inc.
Powering One Stadium Live
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
VS. Co
rrelat
ion S
core
0'
#IBelieveWeWillWin
15' 30' 45' HALF 55' 75' 90' POST GAME
Positive Sentiment Negative Sentiment Ghana & Portugal US Defense
PRE GAME
#IBelieveWeWillWin
Pregame conversation was driven by usage of the #IBelieveWeWillWin hashtag, as users made predictions about the outcome of the game.
Midway through the first half #IBelieveWeWillWin discussion diminished rapidly, as criticism of the American defense increased. Michael Bradley and Jermaine Jones were among the most criticized players on the American team.
Negative sentiment surged as users reacted to Germany's goal. Conversation began to drift as discussion about the Ghana/ Portugal game and US advancement scenarios increased.
Discussion about advancing in the tournament, along with the use of the #IBelieveWeWillWin hashtag resurfaced after the Ghana & Portugal game came to a close.
What were fans talking about on social media?
Topical messages vs. outliers
We cluster incoming messages dynamically to surface topics as they appear in the stream.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
VS.
Italy Uruguay Positive Sentiment Negative Sentiment Marchisio Balotelli Chiellini Suarez
22'!
Users react to the yellow card given to Italy's Balotelli.
15' 45' 60'
59'!
Marchisio conversation spikes as he is carded in the the 59th minute. Italian fans discuss the possibility of losing Balotelli and Marchisio for the subsequent match.
78'!
80'
Discussion about Chiellini and Suarez increase significantly after the Suarez bite. Negative sentiment spikes, driven by disgusted and disappointed fans.
Corre
latio
n Sc
ore
What were fans talking about on social media?
Uruguay
#URU
Suarez
bite
Chiellini Italy
#ITA
foul
Birdfeeder: Listening dynamically
Because words associated with a topic can change, we learn to automatically add new search keywords.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
VS.
#IBelieveWeWillWin Positive Sentiment Negative Sentiment Howard USA Altidore
Corre
lation
Sco
re
#IBelieveThatWeWillWin
Belgium
HALF PRE GAME 0' 15' 30' 55' 75' 90' EXTRA TIME
Prior to the start of the match, discussion about the US team significantly overshadowed conversation surrounding the Belgian team. The #IBelieveWeWillWin hashtag also peaked as fans expressed confidence in their team.
Discussion about substituting Jozy Altidore in spiked after Fabian Johnson suffered a hamstring injury in the 29th minute of play.
Conversation in the second half was driven almost entirely by users praising Tim Howard's performance. Positive sentiment spiked alongside discussion about the goaltender, as many described it as the best goaltending performance they'd ever seen.
What were fans talking about on social media?
Following any topic of discussion
Here is a screenshot of the system following discussion about Apple products.
More Thoughts on World Cup Data
• Text data, while big, both necessitates and allows you to use more complex techniques to a\ack it.
• Roughly, you have to collect ten tweets for each useful tweet.
Compass as a product
Questions? [email protected]