39
Bringing Coherence to Chaos Automated Analysis on LargeScale Data for the 2015 World Cup Catherine Havasi Luminoso Text Analy6cs World, San Francisco April 1, 2015

Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

 Bringing  Coherence  to  Chaos  

 Automated  Analysis    on  Large-­‐Scale  Data    

for  the  2015  World  Cup  

Catherine  Havasi  Luminoso  

Text  Analy6cs  World,  San  Francisco  April  1,  2015  

Page 2: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Big Data

Topgold,  CC-­‐by  

Page 3: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Interest in Big Data

Page 4: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

We got to do text analytics for the World Cup!

CC-­‐by  Berndt  Meyer  

Page 5: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to
Page 6: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Big Data – The Vs

•  Volume  – Scale  of  the  data  

•  Velocity    – Analysis  of  streaming  data  

•  Variety  – Data  from  different  sources  

•  Veracity    – How  messy  is  the  data  

Page 7: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Big Data – The Vs

•  Volume  – Scale  of  the  data  

•  Velocity    – Analysis  of  streaming  data  

•  Variety  – Data  from  different  sources  

•  Veracity    – How  messy  is  the  data  

THINGS  WE  DON’T  KNOW    

HOW  TO  HANDLE  

Page 8: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

But do any of us really have big data?

Page 9: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Will we have big text in the future?

4/1/15   Confiden6al  �  www.luminoso.com   9  

Page 10: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

How much text is there now?

•  Much  of  the  text  is  thrown  away  when  it’s  received  or  stored  without  indexing.  

•  Most  things  stored  in  the  back  of  virtual  closets  are  untouched.    –  The  problem  here  is  formaQng  and  data  alignment.  

•  OSen  people  working  with  big  text  think  of  Social  Media.      – Are  blogs  and  reviews  “social  media”  or  just  social  networks?  

Page 11: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

“Public Social Media” == Twitter

This  from  the  last  day  of  the  World  Cup.  

0  

10  

20  

30  

40  

50  

60  

70  

80  

90  

100  

Twi\er   Open  Facebook   Google  plus  

Page 12: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

So what does Twitter look like?

•  We  have  collec6vely  sent  300  billion  tweets.  – That’s  30  tb  of  text  on  twi\er,  ever.  

•  Now,  we  send  500  million  tweets  per  day.  – Some  of  this  is  bots.  How  many  is  up  to  much,  much  debate.  (What  is  a  bot?)  

– 40%  of  Twi\er  users  don’t  tweet  •  Twi\er  is  super  interna6onal.  

– Only  24%  is  from  North  America  – Many  are  from  Asia  (33%),  esp.  Japan  

Page 13: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

That’s not BIG data

Source:  XKCD  

Page 14: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Twitter can be high volume

Record: 143,199 tweets per second.

Page 15: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

And Spiky. And repetitive.

143,199 SEMANTICALLY IDENTICAL tweets per second.

Page 16: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

This can be a problem during the World Cup

Page 17: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

And of course, spammy.

Page 18: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

But what is spam?

Page 19: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Twitter is getting crowded

James  Clifford,  CC-­‐by  

Page 20: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Why not just count words?

   

Page 21: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Language is Creative

It was really stuffy.

Smelled really musty. Reminds me of a dusty closet.

Was like a wet dog.

It was like it had been shut away for a long time.

Smells like an old house.

Really stale.

It smelled terrible.

Page 22: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

MIT Media Lab!

E14-474 - Digital Intuition!conceptnet5.media.mit.edu!

Page 23: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Common Sense Computing

Page 24: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

To understand a new word or concept, we first compare it to other concepts we already know. We then make analogies to those other concepts.

How do we understand concepts?

Page 25: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Digital Intuition

•  General  knowledge  is  a  great  star6ng  point  •  It  probably  doesn’t  include  everything  the  computer  should  know  about  your  domain  

•  Luminoso’s  approach:  – Start  with  ConceptNet  – Modify  it  based  on  the  way  words  are  used  in  your  domain  

Page 26: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Luminoso

Page 27: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Sony’s One Stadium

Page 28: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

The One Stadium Use Case

Page 29: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

LUMINOSO

Removes duplicates

Emphasizes meaningful discussion

Eliminates spam

Clusters conversation

Analyzing Twitter

Page 30: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

LUMINOSO ONE STADIUM LIVE

www.luminoso.com 9 Luminoso Technologies Inc.

Powering One Stadium Live

Page 31: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

VS. Co

rrelat

ion S

core

0'

#IBelieveWeWillWin

15' 30' 45' HALF 55' 75' 90' POST GAME

Positive Sentiment Negative Sentiment Ghana & Portugal US Defense

PRE GAME

#IBelieveWeWillWin

Pregame conversation was driven by usage of the #IBelieveWeWillWin hashtag, as users made predictions about the outcome of the game.

Midway through the first half #IBelieveWeWillWin discussion diminished rapidly, as criticism of the American defense increased. Michael Bradley and Jermaine Jones were among the most criticized players on the American team.

Negative sentiment surged as users reacted to Germany's goal. Conversation began to drift as discussion about the Ghana/ Portugal game and US advancement scenarios increased.

Discussion about advancing in the tournament, along with the use of the #IBelieveWeWillWin hashtag resurfaced after the Ghana & Portugal game came to a close.

What were fans talking about on social media?

Page 32: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Topical messages vs. outliers

We cluster incoming messages dynamically to surface topics as they appear in the stream.

Page 33: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

VS.

Italy Uruguay Positive Sentiment Negative Sentiment Marchisio Balotelli Chiellini Suarez

22'!

Users react to the yellow card given to Italy's Balotelli.

15' 45' 60'

59'!

Marchisio conversation spikes as he is carded in the the 59th minute. Italian fans discuss the possibility of losing Balotelli and Marchisio for the subsequent match.

78'!

80'

Discussion about Chiellini and Suarez increase significantly after the Suarez bite. Negative sentiment spikes, driven by disgusted and disappointed fans.

Corre

latio

n Sc

ore

What were fans talking about on social media?

Page 34: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Uruguay  

#URU  

Suarez  

bite  

Chiellini   Italy  

#ITA  

foul  

Birdfeeder: Listening dynamically

Because words associated with a topic can change, we learn to automatically add new search keywords.

Page 35: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

VS.

#IBelieveWeWillWin Positive Sentiment Negative Sentiment Howard USA Altidore

Corre

lation

Sco

re

#IBelieveThatWeWillWin

Belgium

HALF PRE GAME 0' 15' 30' 55' 75' 90' EXTRA TIME

Prior to the start of the match, discussion about the US team significantly overshadowed conversation surrounding the Belgian team. The #IBelieveWeWillWin hashtag also peaked as fans expressed confidence in their team.

Discussion about substituting Jozy Altidore in spiked after Fabian Johnson suffered a hamstring injury in the 29th minute of play.

Conversation in the second half was driven almost entirely by users praising Tim Howard's performance. Positive sentiment spiked alongside discussion about the goaltender, as many described it as the best goaltending performance they'd ever seen.

What were fans talking about on social media?

Page 36: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Following any topic of discussion

Here is a screenshot of the system following discussion about Apple products.

Page 37: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

More Thoughts on World Cup Data

•  Text  data,  while  big,  both  necessitates  and  allows  you  to  use  more  complex  techniques  to  a\ack  it.  

•  Roughly,  you  have  to  collect  ten  tweets  for  each  useful  tweet.  

Page 38: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Compass as a product

Page 39: Bringing!Coherence!toChaos - Text Analytics WorldCatherine!Havasi! Luminoso TextAnaly6cs!World,!San!Francisco! April!1,2015. Big Data Topgold,CCCby. Interest in Big Data . We got to

Questions? [email protected]