Cascade Project

Preview:

DESCRIPTION

A look at the challenges involved in creating a big data product in the context of the Cascade Project (https://www.cascadeproject.com/)

Citation preview

Real Time-Big Data-Social Network-Data Science-Gamified!

a.k.a. The Cascade Project(Okay … that last part of the title isn’t true)

Jason Capehart12/12/12

1.Visualization

2.Data

3.Analysis

Show Me!

The Good, The Bad, The Ugly

Store Examples

Key-Value Hadoop, Memcached, Redis

Document MongoDB, CouchDB

Graph Neo4j, Giraph, Titan

Real Time Storm, Impala

Surely, You Must Be Joking.

Citation:Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600). Raleigh, NC: ACM.

ln𝑝 (𝑥 )=α ln𝑥+𝐶

Citation:A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111)

800,000,000(that’s a lot of users)

(cost = 200k for fire hose)

Sampled

Not Sampled

Citation:Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.

# Pseudo Code

id_guess = randint(0, 10^9)

user = api.get_user(id = id_guess)

Repeat until tired or rate limited

Discrete Power Law vs. Lognormal

Loglikelihood Ratio

89.46

Vuong’s Test Statistic

7.14

p-val(1-sided)

>0.99

Power Law (xmin = 281, α = 2.19)

Lognormal

Power Law (xmin = 222, α = 2.33)

Lognormal

Stretched Exponential

• Conclusions = None!– All work is in progress

• Discussion– Cascade uses open source– Opportunities to give back?

References

1. A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111)– Code: http://tuvalu.santafe.edu/~aaronc/powerlaws/

2. Newman, M. (2005, September-October). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323-351.

3. Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600). Raleigh, NC: ACM

4. Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.

Recommended