18
We are losing our tweets! An analysis, a prototype, lessons learned, and proposed third party solution to the problem John O’Brien III @ jobrieniii http://www.linkedin.com/in/jobrieniii

We are losing our tweets!

Embed Size (px)

DESCRIPTION

Lessons learned from TwapperKeeper prototype.

Citation preview

Page 1: We are losing our tweets!

We are losing our tweets!

An analysis, a prototype, lessons learned, and proposed third party solution to the problem

John O’Brien III@jobrieniii

http://www.linkedin.com/in/jobrieniii

Page 2: We are losing our tweets!

Twitter “Primer” Social network / micro blogging

site

Send / read 140 character messages

You can follow anyone, and they can follow you

Sent messages are delivered to all your followers

Sent messages are also publically indexed and searchable

Permissions can be established to restrict delivery, but this is not the norm

Page 3: We are losing our tweets!

ProblemAs the usage of Twitter has exploded, Twitter’s

ability to provide long term access to tweets that mention key events (typically

#hashtag’ed) has eroded

Page 4: We are losing our tweets!

First, who cares?Individuals

Bloggers

Conference Attendees / Leaders

Academia / “Web” Ecologists

Media Outlets

Companies

Government

Page 5: We are losing our tweets!

So lets dive into the problem...

Followers Search

Page 6: We are losing our tweets!

Search UI / API ConstraintsLimited to keywords, #hashtags, or @mentions within 140

char body of tweet

100 tweets x 15 pages = 1500 per search term

For a given keyword, exists in search for “around 1.5 weeks but is dynamic and subject to shrink as the number of tweets per day continues to grow.”

– Twitter website

Page 7: We are losing our tweets!

Hmmmm….No other ‘in the cloud’ sites were found back in

June, only client side applications and ‘hacked’ custom scripts

RSS feeds were considered but initially dismissed because they typically require an end user client

Decision was to “build our own” and see if we can solve the problem

Page 8: We are losing our tweets!

A little bit about my thoughts on the SDLC process…

“Minimally Viable”PROTOTY

PE

**FOCUS**ON

LEARNING

Page 9: We are losing our tweets!

“Minimally Viable” Micro App

What if we could get ahead of the problem and store the data before Twitter “loses” it?

Functional Requirements Ability for user to define #hashtags of importance Create a background script that leverages the

Twitter /search REST API to keep an eye on each hash tag and store data in local database**Sweep, grab, and record…**

Must be running at all times and publically available

Technical Specs Build on LAMP stack, put into the cloud, running

24/7/365

Page 10: We are losing our tweets!

“Minimally Viable” Micro App

Twitter/search

API

Our Database

internet

php script to query each#hashtag

Page 11: We are losing our tweets!

TwapperKeeper.com “BETA”was born on Saturday and released to public on

Sunday…

Page 12: We are losing our tweets!

And we started to grow and get customer feedback…

Page 13: We are losing our tweets!

And we lived through a key world event…

http://mashable.com/2009/09/16/white-house-

records/

Page 14: We are losing our tweets!

So what did we learn? We need to be whitelisted

People often don’t start the archiving until after they start using #hashtags Thus, point forward solution not

enough, need to reach back as well

While hashtags are the norm, some people would just like to track keywords

Velocity of tweets can be a major issue What if a hashtag results are greater

than 1500 tweets per minute?

Hashtags of archive interest typically spike in velocity and die off in traffic. However some archives get VERY,

VERY big!

Page 15: We are losing our tweets!

And more learning… URL shortening services are of long

time concern to users and archiving community

Twitter /search REST API periodically is unresponsive

Twitter /search REST API sometimes glitches and returns duplicate data

People want not only output in html, but raw exports for publication, analysis and real time consumption (txt, csv, xml, json, etc)

Twitter engineers contact us and recommend also incorporating newly released real time streams /track, /sample , /firehose

Page 16: We are losing our tweets!

Recommended “out-of-beta” V2.0

Anticipate #hashtags to archive based upon Twitter trending stats and autocreate archives

Hybrid approach of using /search and /track (real time stream) APIs to handle velocity issues

Check for duplicates “before” inserts

Implement monitoring and “self healing” services

Shortened URLs should be resolved into fully qualified URLs and stored separately for reference (at time of capture)

Create TwapperKeeper API by modularizing the archiving engine into a SOA architecture (/create, /info, /get) for internal and external consumption

Include additional output formats to be provided for download “Extracts” of large archives should be

automatically generated on a daily basis and made available for download

VERSION 2.0

Page 17: We are losing our tweets!

Recommended “out-of-beta” V2.0

Twitter/search

API

Our Databas

e

hybrid php / curl

scriptto archive

per #hashtag

Twitter/trends

API

Twitter/track

API

autocreate trends

short url lookup

api/

create/info/get

external sites

File extractor

Monitor Health

and Self Heal

Page 18: We are losing our tweets!

Questions?