Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso...

Preview:

Citation preview

 Bringing  Coherence  to  Chaos  

 Automated  Analysis    on  Large-­‐Scale  Data    

for  the  2015  World  Cup  

Catherine  Havasi  Luminoso  

Text  Analy6cs  World,  San  Francisco  April  1,  2015  

Big Data

Topgold,  CC-­‐by  

Interest in Big Data

We got to do text analytics for the World Cup!

CC-­‐by  Berndt  Meyer  

Big Data – The Vs

•  Volume  – Scale  of  the  data  

•  Velocity    – Analysis  of  streaming  data  

•  Variety  – Data  from  different  sources  

•  Veracity    – How  messy  is  the  data  

Big Data – The Vs

•  Volume  – Scale  of  the  data  

•  Velocity    – Analysis  of  streaming  data  

•  Variety  – Data  from  different  sources  

•  Veracity    – How  messy  is  the  data  

THINGS  WE  DON’T  KNOW    

HOW  TO  HANDLE  

But do any of us really have big data?

Will we have big text in the future?

4/1/15   Confiden6al  �  www.luminoso.com   9  

How much text is there now?

•  Much  of  the  text  is  thrown  away  when  it’s  received  or  stored  without  indexing.  

•  Most  things  stored  in  the  back  of  virtual  closets  are  untouched.    –  The  problem  here  is  formaQng  and  data  alignment.  

•  OSen  people  working  with  big  text  think  of  Social  Media.      – Are  blogs  and  reviews  “social  media”  or  just  social  networks?  

“Public Social Media” == Twitter

This  from  the  last  day  of  the  World  Cup.  

0  

10  

20  

30  

40  

50  

60  

70  

80  

90  

100  

Twi\er   Open  Facebook   Google  plus  

So what does Twitter look like?

•  We  have  collec6vely  sent  300  billion  tweets.  – That’s  30  tb  of  text  on  twi\er,  ever.  

•  Now,  we  send  500  million  tweets  per  day.  – Some  of  this  is  bots.  How  many  is  up  to  much,  much  debate.  (What  is  a  bot?)  

– 40%  of  Twi\er  users  don’t  tweet  •  Twi\er  is  super  interna6onal.  

– Only  24%  is  from  North  America  – Many  are  from  Asia  (33%),  esp.  Japan  

That’s not BIG data

Source:  XKCD  

Twitter can be high volume

Record: 143,199 tweets per second.

And Spiky. And repetitive.

143,199 SEMANTICALLY IDENTICAL tweets per second.

This can be a problem during the World Cup

And of course, spammy.

But what is spam?

Twitter is getting crowded

James  Clifford,  CC-­‐by  

Why not just count words?

   

Language is Creative

It was really stuffy.

Smelled really musty. Reminds me of a dusty closet.

Was like a wet dog.

It was like it had been shut away for a long time.

Smells like an old house.

Really stale.

It smelled terrible.

MIT Media Lab!

E14-474 - Digital Intuition!conceptnet5.media.mit.edu!

Common Sense Computing

To understand a new word or concept, we first compare it to other concepts we already know. We then make analogies to those other concepts.

How do we understand concepts?

Digital Intuition

•  General  knowledge  is  a  great  star6ng  point  •  It  probably  doesn’t  include  everything  the  computer  should  know  about  your  domain  

•  Luminoso’s  approach:  – Start  with  ConceptNet  – Modify  it  based  on  the  way  words  are  used  in  your  domain  

Luminoso

Sony’s One Stadium

The One Stadium Use Case

LUMINOSO

Removes duplicates

Emphasizes meaningful discussion

Eliminates spam

Clusters conversation

Analyzing Twitter

LUMINOSO ONE STADIUM LIVE

www.luminoso.com 9 Luminoso Technologies Inc.

Powering One Stadium Live

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

VS. Co

rrelat

ion S

core

0'

#IBelieveWeWillWin

15' 30' 45' HALF 55' 75' 90' POST GAME

Positive Sentiment Negative Sentiment Ghana & Portugal US Defense

PRE GAME

#IBelieveWeWillWin

Pregame conversation was driven by usage of the #IBelieveWeWillWin hashtag, as users made predictions about the outcome of the game.

Midway through the first half #IBelieveWeWillWin discussion diminished rapidly, as criticism of the American defense increased. Michael Bradley and Jermaine Jones were among the most criticized players on the American team.

Negative sentiment surged as users reacted to Germany's goal. Conversation began to drift as discussion about the Ghana/ Portugal game and US advancement scenarios increased.

Discussion about advancing in the tournament, along with the use of the #IBelieveWeWillWin hashtag resurfaced after the Ghana & Portugal game came to a close.

What were fans talking about on social media?

Topical messages vs. outliers

We cluster incoming messages dynamically to surface topics as they appear in the stream.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

VS.

Italy Uruguay Positive Sentiment Negative Sentiment Marchisio Balotelli Chiellini Suarez

22'!

Users react to the yellow card given to Italy's Balotelli.

15' 45' 60'

59'!

Marchisio conversation spikes as he is carded in the the 59th minute. Italian fans discuss the possibility of losing Balotelli and Marchisio for the subsequent match.

78'!

80'

Discussion about Chiellini and Suarez increase significantly after the Suarez bite. Negative sentiment spikes, driven by disgusted and disappointed fans.

Corre

latio

n Sc

ore

What were fans talking about on social media?

Uruguay  

#URU  

Suarez  

bite  

Chiellini   Italy  

#ITA  

foul  

Birdfeeder: Listening dynamically

Because words associated with a topic can change, we learn to automatically add new search keywords.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

VS.

#IBelieveWeWillWin Positive Sentiment Negative Sentiment Howard USA Altidore

Corre

lation

Sco

re

#IBelieveThatWeWillWin

Belgium

HALF PRE GAME 0' 15' 30' 55' 75' 90' EXTRA TIME

Prior to the start of the match, discussion about the US team significantly overshadowed conversation surrounding the Belgian team. The #IBelieveWeWillWin hashtag also peaked as fans expressed confidence in their team.

Discussion about substituting Jozy Altidore in spiked after Fabian Johnson suffered a hamstring injury in the 29th minute of play.

Conversation in the second half was driven almost entirely by users praising Tim Howard's performance. Positive sentiment spiked alongside discussion about the goaltender, as many described it as the best goaltending performance they'd ever seen.

What were fans talking about on social media?

Following any topic of discussion

Here is a screenshot of the system following discussion about Apple products.

More Thoughts on World Cup Data

•  Text  data,  while  big,  both  necessitates  and  allows  you  to  use  more  complex  techniques  to  a\ack  it.  

•  Roughly,  you  have  to  collect  ten  tweets  for  each  useful  tweet.  

Compass as a product

Questions? havasi@luminoso.com

   

Recommended