22
Real Time-Big Data-Social Network-Data Science-Gamified! a.k.a. The Cascade Project (Okay … that last part of the title isn’t true) Jason Capehart 12/12/12

Cascade Project

Embed Size (px)

DESCRIPTION

A look at the challenges involved in creating a big data product in the context of the Cascade Project (https://www.cascadeproject.com/)

Citation preview

Page 1: Cascade Project

Real Time-Big Data-Social Network-Data Science-Gamified!

a.k.a. The Cascade Project(Okay … that last part of the title isn’t true)

Jason Capehart12/12/12

Page 2: Cascade Project

1.Visualization

2.Data

3.Analysis

Page 3: Cascade Project
Page 4: Cascade Project
Page 5: Cascade Project

Show Me!

Page 6: Cascade Project
Page 7: Cascade Project
Page 8: Cascade Project

The Good, The Bad, The Ugly

Page 9: Cascade Project

Store Examples

Key-Value Hadoop, Memcached, Redis

Document MongoDB, CouchDB

Graph Neo4j, Giraph, Titan

Real Time Storm, Impala

Surely, You Must Be Joking.

Page 10: Cascade Project
Page 11: Cascade Project

Citation:Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600). Raleigh, NC: ACM.

Page 12: Cascade Project

ln𝑝 (𝑥 )=α ln𝑥+𝐶

Citation:A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111)

Page 13: Cascade Project

800,000,000(that’s a lot of users)

(cost = 200k for fire hose)

Page 14: Cascade Project

Sampled

Not Sampled

Citation:Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.

Page 15: Cascade Project
Page 16: Cascade Project

# Pseudo Code

id_guess = randint(0, 10^9)

user = api.get_user(id = id_guess)

Repeat until tired or rate limited

Page 17: Cascade Project
Page 18: Cascade Project

Discrete Power Law vs. Lognormal

Loglikelihood Ratio

89.46

Vuong’s Test Statistic

7.14

p-val(1-sided)

>0.99

Power Law (xmin = 281, α = 2.19)

Lognormal

Page 19: Cascade Project
Page 20: Cascade Project

Power Law (xmin = 222, α = 2.33)

Lognormal

Stretched Exponential

Page 21: Cascade Project

• Conclusions = None!– All work is in progress

• Discussion– Cascade uses open source– Opportunities to give back?

Page 22: Cascade Project

References

1. A. Clauset, C.R. Shalizi, and M.E.J. Newman, "Power-law distributions in empirical data" SIAM Review 51(4), 661-703 (2009). (arXiv:0706.1062, doi:10.1137/070710111)– Code: http://tuvalu.santafe.edu/~aaronc/powerlaws/

2. Newman, M. (2005, September-October). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323-351.

3. Kwak, H., Changhyun, L., Park, H., & Moon, S. (2010). What is Twitter, a Social Network or a News Media? Proceedings of the 19th International World Wide Web (WWW) Conference (pp. 591-600). Raleigh, NC: ACM

4. Stumpf, M. P., Wiuf, C., & May, R. M. (2005). Subnets of scale-free networks are not scale-free: Sampling properties of networks. Proceedings of the National Academy of Sciences, 4221-4224.