Upload
john-obrien-iii
View
2.067
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Lessons learned from TwapperKeeper prototype.
Citation preview
We are losing our tweets!
An analysis, a prototype, lessons learned, and proposed third party solution to the problem
John O’Brien III@jobrieniii
http://www.linkedin.com/in/jobrieniii
Twitter “Primer” Social network / micro blogging
site
Send / read 140 character messages
You can follow anyone, and they can follow you
Sent messages are delivered to all your followers
Sent messages are also publically indexed and searchable
Permissions can be established to restrict delivery, but this is not the norm
ProblemAs the usage of Twitter has exploded, Twitter’s
ability to provide long term access to tweets that mention key events (typically
#hashtag’ed) has eroded
First, who cares?Individuals
Bloggers
Conference Attendees / Leaders
Academia / “Web” Ecologists
Media Outlets
Companies
Government
So lets dive into the problem...
Followers Search
Search UI / API ConstraintsLimited to keywords, #hashtags, or @mentions within 140
char body of tweet
100 tweets x 15 pages = 1500 per search term
For a given keyword, exists in search for “around 1.5 weeks but is dynamic and subject to shrink as the number of tweets per day continues to grow.”
– Twitter website
Hmmmm….No other ‘in the cloud’ sites were found back in
June, only client side applications and ‘hacked’ custom scripts
RSS feeds were considered but initially dismissed because they typically require an end user client
Decision was to “build our own” and see if we can solve the problem
A little bit about my thoughts on the SDLC process…
“Minimally Viable”PROTOTY
PE
**FOCUS**ON
LEARNING
“Minimally Viable” Micro App
What if we could get ahead of the problem and store the data before Twitter “loses” it?
Functional Requirements Ability for user to define #hashtags of importance Create a background script that leverages the
Twitter /search REST API to keep an eye on each hash tag and store data in local database**Sweep, grab, and record…**
Must be running at all times and publically available
Technical Specs Build on LAMP stack, put into the cloud, running
24/7/365
“Minimally Viable” Micro App
Twitter/search
API
Our Database
internet
php script to query each#hashtag
TwapperKeeper.com “BETA”was born on Saturday and released to public on
Sunday…
And we started to grow and get customer feedback…
And we lived through a key world event…
http://mashable.com/2009/09/16/white-house-
records/
So what did we learn? We need to be whitelisted
People often don’t start the archiving until after they start using #hashtags Thus, point forward solution not
enough, need to reach back as well
While hashtags are the norm, some people would just like to track keywords
Velocity of tweets can be a major issue What if a hashtag results are greater
than 1500 tweets per minute?
Hashtags of archive interest typically spike in velocity and die off in traffic. However some archives get VERY,
VERY big!
And more learning… URL shortening services are of long
time concern to users and archiving community
Twitter /search REST API periodically is unresponsive
Twitter /search REST API sometimes glitches and returns duplicate data
People want not only output in html, but raw exports for publication, analysis and real time consumption (txt, csv, xml, json, etc)
Twitter engineers contact us and recommend also incorporating newly released real time streams /track, /sample , /firehose
Recommended “out-of-beta” V2.0
Anticipate #hashtags to archive based upon Twitter trending stats and autocreate archives
Hybrid approach of using /search and /track (real time stream) APIs to handle velocity issues
Check for duplicates “before” inserts
Implement monitoring and “self healing” services
Shortened URLs should be resolved into fully qualified URLs and stored separately for reference (at time of capture)
Create TwapperKeeper API by modularizing the archiving engine into a SOA architecture (/create, /info, /get) for internal and external consumption
Include additional output formats to be provided for download “Extracts” of large archives should be
automatically generated on a daily basis and made available for download
VERSION 2.0
Recommended “out-of-beta” V2.0
Twitter/search
API
Our Databas
e
hybrid php / curl
scriptto archive
per #hashtag
Twitter/trends
API
Twitter/track
API
autocreate trends
short url lookup
api/
create/info/get
external sites
File extractor
Monitor Health
and Self Heal
Questions?