Upload
matthew-russell
View
1.365
Download
0
Tags:
Embed Size (px)
DESCRIPTION
GDA Presentation - Quito Ecuador - 20 Sept 2013
Citation preview
Mining Social Web Data Like a Pro: Four Steps to Success
Presented by Matthew A. Russell
"Data Journalism and Interactivity" - GDA Seminar
Quito, Ecuador - 20 September 2013
1
Hola
2
Trained as a Computer Scientist
CTO @ Digital Reasoning Systems
Data Mining, Machine Learning
Principal @ Zaffra
Boutique Consulting
Author @ O'Reilly Media
5 published books on technology
3
Transform Curiosity Into Insight
4
An open source project
http://bit.ly/MiningTheSocialWeb2E
Inherently accessible
Virtual machine & IPython Notebook UX
Turn-key code templates for bootstrapping data science experiments
Think of the book as "premium" support for the OSS project
¿Por qué no Español?
5
Investigative Journalist
6
"A person whose profession it is to
discover the truth and to identify lapses from
it in whatever media may be available."
Data Science
7
Data => Actionable Information
Highly interdisciplinary
Nascent
Necessary
http://wikipedia.org/wiki/Data_science
Digital Signal Explosion
A model for the world: signal and sinks
Growth in data exhaust is accelerating
Digital fingerprints
Software is eating the world
Data mining opportunities galore...
8
Digital Data Stats100 terabytes of data uploaded daily to Facebook.
Brands and organizations on Facebook receive 34,722 Likes every minute of the day.
According to Twitter’s own research in early 2012, it sees roughly 175 million tweets every day
30 Billion pieces of content shared on Facebook every month.
Data production will be 44 times greater in 2020 than it was in 2009
According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years.
9
See http://wikibon.org/blog/big-data-statistics
Social Media Is All the Rage
World population: ~7B people
Facebook: 1.15B users
Twitter: 500M users
Google+ 343M users
LinkedIn: 238M users
~200M+ blogs (conservative estimate)
10
But Why Is It All the Rage?
It satisfies fundamental human desires
We want to be heard
We want to satisfy our curiosity
We want it easy
We want it now
11
12
Roberto Mercedes
Jorge
Ana
Nina
Social Network Mechanics
Interest Graph Mechanics
13
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan Luis
Guerra
Juan Luís
Guerra
A (Social) Interest Graph
14
Roberto Mercedes
Jorge
Ana
Nina
U2
Juan Luis
Guerra
Juan Luís
Guerra
A (Political) Interest Graph
15
Roberto Mercedes
Jorge
Ana
Nina
Johnny Araya
Rodolfo Hernández
Social Media Dimensions
16
Accounts Types: People & Pages
Mutual Connections
"Likes"
"Shares"
"Comments"
Extensive Privacy Controls
Accounts Types: "Anything"
"Following" Relationships
Favorites
Retweets
Replies
(Almost) No Privacy Controls
Why Does This Matter?
"If you can measure it, you can improve it"
Modeling Behavior
Predictive Analysis
Recommending Content
Swaying political situations might just be the ultimate value proposition for social media
17
Social Media Analysis Framework
Four Steps To Success
Aspire
Acquire
Analyze
Summarize
Let's step through a trivial example...
18
(1) Aspire
Let's frame a trivial hypothesis to illustrate the four steps...
Frame a hypothesis about some real world phenomenon
For example: "Johnny Araya is a more popular candidate than Rodolfo Hernández"
Let's use social media as a basis of investigation
19
(2) Acquire
Collect the data that you need to test the hypothesis
How?
Use Facebook and Twitter APIs to harvest data about each candidate
Go after low hanging fruit before something more complex
You don't even need to write code to do this (yet)
20
They're both on Facebook
21
http://facebook.com/ElDoctor2014
http://facebook.com/JohnnyArayaMonge
They're both on Twitter
22
@Johnny_Araya@ElDoctor2014
(3) AnalyzeCount, Filter, and Rank the Data
Johnny Araya:
~50k Facebook likes
~14k Twitter followers
Rodolfo Hernández:
~37k Facebook likes;
745 Twitter followers
Johnny Araya is indeed more popular in social media
23
(4) Summarize
Present the data in a concise and easily understood manner
Charts
Tables
Simple visualizations
Some examples...
24
25
Araya%
Hernandez%
Araya%
Hernandez%
Twitter Popularity
Social Media Popularity: Araya vs Hernández
Facebook Popularity
26
0"
10000"
20000"
30000"
40000"
50000"
60000"
Araya" Hernandez"
Twi5er"followers"
Facebook"fans"
Social Media Popularity: Araya vs Hernández
27
1"
10"
100"
1000"
10000"
100000"
Araya" Hernandez"
Twi0er"followers"
Facebook"fans"
Social Media Popularity: Araya vs Hernández
Twitter Popularity
28
Facebook Popularity
29
JohnnyArayaMonge,35%,
o0oguevaraguth,17%,
luisguillermosolisr,3%,
villaltaJM,19%,
ElDoctor2014,26%,
Facebook(Likes(for(Costa(Rican(Presiden4al(Candidates(
Recall the previous hypothesis:
"Johnny Araya is a more popular candidate than Rodolfo Hernández"
What do we know now that we didn't before?
The current state of each candidate's Twitter and Facebook popularity
Let's explore a slightly more complex hypothesis...
30
Reflect and Refine...
(1) Aspire
Redefine the hypothesis:
For example: "Johnny Araya has a more effective social media strategy than Rodolfo Hernández"
Presumably because of his superior social media status at the moment
31
(2) Acquire
Collect the data that you need to test the hypothesis
How? Use APIs to harvest data about each candidate
Let's consider any Facebook posts for 2013
32
33
for candidate in ['JohnnyArayaMonge', 'ElDoctor2014']:
# Get the data
url = 'https://graph.facebook.com/{0}?' + \ fields= posts.limit(500)&access_token=XXX'.format(candidate) content = requests.get(url).json()
# Save the data
f = open(candidate + ".json", "w") f.write(json.dumps(content)) f.close()
Python Source Code
(3) Analyze
34
Count, Filter, and Rank the Data
Some more Python source code to crunch the numbers
Extract Facebook likes and shares this year
Facebook Vitals
35
ElDoctor2014Total Likes 37495Num Posts since Jan 1, 2013 (of 500 possible) 436Total Post Likes 155473Total Post Shares 9684Oldest Post in Batch 2013-03-15T00:40:21+0000Num posts prior to Jan 1, 2013 0Avg likes/post 356.589449541 (0.951032003044%)Avg shares/post 22.2110091743 (0.059237256099%)Post Types [(u'photo', 286), (u'link', 77), (u'status', 40), (u'video', 32), (u'swf', 1)]
JohnnyArayaMongeTotal Likes 50301Num Posts since Jan 1, 2013 (of 500 possible) 205Total Post Likes 176161Total Post Shares 7542Oldest Post in Batch 2013-01-01T07:18:43+0000Num posts prior to Jan 1, 2013 190Avg likes/post 859.32195122 (1.70835957778%)Avg shares/post 36.7902439024 (0.0731401838978%)Post Types [(u'photo', 149), (u'status', 38), (u'link', 13), (u'video', 5)]
(4) Summarize
Present the data in a concise and easily understood manner
Like a table...
36
37
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Shares since 1 Jan 13
Avg Likes per Post
Avg Shares per Post
50,301 37,495
205 436
190+ 0
1 Jan 2013 15 March 2013
176,161 155,473
7,542 9,684
859 356
36 22
38
Metric Araya Hernández
Total Likes
Posts since 1 Jan 13
Num Prior Posts
Earliest Post
Post Likes since 1 Jan 13
Post Shares since 1 Jan 13
Avg Likes per Post
Avg Shares per Post
50,301 37,495
205 436
190+ 0
1 Jan 2013 15 March 2013
176,161 155,473
7,542 9,684
859 356
36 22
Recall the hypothesis:
"Johnny Araya has a more effective social media strategy than Rodolfo Hernández because he has more Facebook and Twitter popularity"
What do we know now?
Hernández has Facebook vitals that are quite competitive with Araya
However, Hernández only joined Facebook ~6 months ago!
It would appear that Hernández has the more effective strategy
What is he doing to rise in popularity so quickly?
39
Reflect and Refine...
40
Comparison of Facebok Content
Other Candidates
41
Johnny Araya FB Posts
42
Rodolfo Hernández FB Posts
43
44
Past ~2 Months on Facebook
45
Aug 2013 FB Likes Sept 2013 FB Likes % Change
Johnny Araya
Otto Guevara Guth
José María Villalta Florez-Estrada
Dr. Rodolfo Hernández
Luis Guillermo Solís Rivera
50,301 53,809 6.97%24,146 27,675 14.62%
27,262 35,169 29.00%
37,495 38,298 2.14%
5,334 6,763 26.79%
Past ~3 Months on Twitter
46
Aug 2013 Sept 2013 % Change
Johnny Araya
Otto Guevara Guth
José María Villalta Florez-Estrada
Dr. Rodolfo Hernández
Luis Guillermo Solís Rivera
14,573 15,506 6.40%114 159 39.47%
8,160 8,990 10.17%
745 858 15.17%
1,192 1,487 24.75%
Facebook and Twitter Compared
47
% FB Change % Twitter Change
Johnny Araya
Otto Guevara Guth
José María Villalta Florez-Estrada
Dr. Rodolfo Hernández
Luis Guillermo Solís Rivera
6.97% 6.40%14.62% 39.47%
29.00% 10.17%
2.14% 15.17%
26.79% 24.75%
Your Imagination Is the Only Limit
Analyze the comments that people are leaving on Facebook pages
Try to ascertain common common Facebook fans or Twitter followers amongst candidates
Deduce demographics from social media by synthesizing public data
Theorize about potential "reach" or "influence" using social media
Analyze data in realtime
48
Thinking about Reach
49
Think about "liking" and "following" as opt-ins to feeds
Remember: Interest Graphs
Arriving at effective metrics is tricker than it initially seems
Potential Twitter Influence
50
Araya Hernández
Followers
TheoreticalReach
Reach (10)
Reach (100)
Reach (1000)
Reach (10,000)
"Suspect" Followers
~14k ~750
~40M ~550k
490 673
289 702
2782 X
2832 X
3,246 94
See also http://wp.me/p3QiJd-2a
Potential Influence
51
Who are Candidates Following?
52
What are Candidates Tweeting?
53
Realtime Analysis
54
Monitor Twitter's firehose for realtime data using filters such as #Syria
Keep in mind the sheer volume of data can be considerable
Analysis at MiningTheSocialWeb.com
Mapping #Syria Tweets
55
See http://wp.me/p3QiJd-1t Text
Temporal Analysis on #Syria
56
Analyzing #Syria Tweet Entities
57
Closing Remarks
Software is the gift that keeps on giving
Code it up once, run it ad infinitum...
Code designed for one account will work for other accounts
Analysis is all about knowing what to count
Coding it up is just the dirty work
Start somewhere and then iteratively explore...then exploit
58
Aspire to Do Great Things
Predicting demographic data such as age or gender is possible for some languages
Time and space are fundamentals for grounding online discussions in reality.
Twitter is about as good as it gets for realtime topical analysis
Think of the world as signal producers and signal collectors
Monitoring breaking news events like #Syria
59
The Tip of the Iceberg
60
Stay in Touch
Website: http://MiningTheSocialWeb.com
Twitter: @ptwobrussell
FB: http://facebook.com/MiningTheSocialWeb
LinkedIn: http://linkedin.com/in/ptwobrussell
Email: [email protected]
61