53
Distilling Data Exhaust How to Surface Insights & Build Data Products Feb 2, 2011 Peter Skomoroch LinkedIn @peteskomoroch

O'Reilly Strata: Distilling Data Exhaust

Embed Size (px)

DESCRIPTION

Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata. Includes a preview of some of the technology behind LinkedIn Skills, which I launched in a Keynote with DJ Patil the following day. Video: http://blip.tv/oreilly-promos/distilling-data-exhaust-4780870

Citation preview

Page 1: O'Reilly Strata: Distilling Data Exhaust

Distilling Data Exhaust

How to Surface Insights & Build Data Products

Feb 2, 2011Peter SkomorochLinkedIn@peteskomoroch

Page 2: O'Reilly Strata: Distilling Data Exhaust

What is Data Exhaust?

Page 3: O'Reilly Strata: Distilling Data Exhaust

What is Data Exhaust?

My Delicious Tags

Page 4: O'Reilly Strata: Distilling Data Exhaust

What is Data Exhaust?

Words I use on Twitter

Page 5: O'Reilly Strata: Distilling Data Exhaust

What can you do with it?

•Data has value

•I’ll share some lessons I’ve learned about how to extract that value

•We’ll go through a case study

Page 6: O'Reilly Strata: Distilling Data Exhaust

Part 1) 10 Lessons Learned

Page 7: O'Reilly Strata: Distilling Data Exhaust

1) Choose a meaningful problem

http://www.flickr.com/photos/aloshbennett/

•Find pain points

•Work on stuff that matters

•Look for underutilized data

Page 8: O'Reilly Strata: Distilling Data Exhaust

2) Find or collect relevant data

•DataWrangling

•InfoChimps

•Pete Warden

•Factual, SimpleGeo

•Mechanical Turk

Page 9: O'Reilly Strata: Distilling Data Exhaust

3) Raw is better than processed

•Normalization could be incorrect

•Data might be lost or corrupted

•Good approach: public.resource.org

http://www.flickr.com/photos/nedraggett/347280918/

Page 10: O'Reilly Strata: Distilling Data Exhaust

4) Guide user input when you can

•Auto suggest

•Validate inputs

•Collect tags, votes

•Makes data scrubbing easier

Page 11: O'Reilly Strata: Distilling Data Exhaust

5) Solve easier problems first

http://where2conf.com/where2010/public/schedule/detail/12400

Page 12: O'Reilly Strata: Distilling Data Exhaust

6) Build a baseline model quickly

•Iterate rapidly after baseline is done

•Measure accuracy on hold out test set

Page 13: O'Reilly Strata: Distilling Data Exhaust

7) Test code on sample data

build logical sample data

Page 14: O'Reilly Strata: Distilling Data Exhaust

8) Use Continuous Integration

Page 15: O'Reilly Strata: Distilling Data Exhaust

8) Use Continuous Integrationhttps://github.com/matthayes/azkaban

Page 16: O'Reilly Strata: Distilling Data Exhaust

9) Pick the right tool for the job

Page 17: O'Reilly Strata: Distilling Data Exhaust

10) Developer productivity is key

•Fast Iterations: Python, Ruby, Pig

•Convention over configuration

•Embrace Github, DevOps, & EC2

•Currently using JRuby & Sinatra

Page 18: O'Reilly Strata: Distilling Data Exhaust

SNA Team: sna-projects.com

Page 19: O'Reilly Strata: Distilling Data Exhaust

Part 2) Case Study: Strata

Page 20: O'Reilly Strata: Distilling Data Exhaust

Conference Insights

•I’d like to understand the audience at Strata

•What companies do we work for?

•What are the top skills at Strata?

•Do attendees cluster together based on skill?

Page 21: O'Reilly Strata: Distilling Data Exhaust

Round up a Data Viz team

Page 22: O'Reilly Strata: Distilling Data Exhaust

Use the right tools

•Data Crunching: Hadoop, Pig

•Statistical Work: Python, NumPy

•Visualization: Gephi

Page 23: O'Reilly Strata: Distilling Data Exhaust

Find Some Data: Attendees

Page 24: O'Reilly Strata: Distilling Data Exhaust

Add LinkedIn data

Page 25: O'Reilly Strata: Distilling Data Exhaust
Page 26: O'Reilly Strata: Distilling Data Exhaust
Page 27: O'Reilly Strata: Distilling Data Exhaust

Extract Skills from Profiles

What are skills?

Extract

Page 28: O'Reilly Strata: Distilling Data Exhaust

Build Hadoop Skill Graph

Discover

Core Talent Graph for “Hadoop”Igor Perisic

Page 29: O'Reilly Strata: Distilling Data Exhaust

The Talent Graph

Page 30: O'Reilly Strata: Distilling Data Exhaust
Page 31: O'Reilly Strata: Distilling Data Exhaust
Page 32: O'Reilly Strata: Distilling Data Exhaust
Page 33: O'Reilly Strata: Distilling Data Exhaust
Page 34: O'Reilly Strata: Distilling Data Exhaust
Page 35: O'Reilly Strata: Distilling Data Exhaust
Page 36: O'Reilly Strata: Distilling Data Exhaust
Page 37: O'Reilly Strata: Distilling Data Exhaust
Page 38: O'Reilly Strata: Distilling Data Exhaust
Page 39: O'Reilly Strata: Distilling Data Exhaust
Page 40: O'Reilly Strata: Distilling Data Exhaust

We can combine skills with the attendee directory to better understand Strata

Page 41: O'Reilly Strata: Distilling Data Exhaust

What are skills @Strata?

Page 42: O'Reilly Strata: Distilling Data Exhaust

Extract skills for attendees

Page 43: O'Reilly Strata: Distilling Data Exhaust

Top Skills @Strata

Page 44: O'Reilly Strata: Distilling Data Exhaust

Information Overload

Page 45: O'Reilly Strata: Distilling Data Exhaust

Relevance Measures

Jaccard Similarity

TFIDF

Page 46: O'Reilly Strata: Distilling Data Exhaust

Relevant Skills @Strata

Page 47: O'Reilly Strata: Distilling Data Exhaust

Do attendees cluster together based on skills?

Page 48: O'Reilly Strata: Distilling Data Exhaust

•Compute similarity of attendees based on skill vector distance

•Cluster similarities in Gephi

Page 49: O'Reilly Strata: Distilling Data Exhaust
Page 50: O'Reilly Strata: Distilling Data Exhaust

More analysis on the way

•DJ Patil has a session tomorrow

•We’ll blog about additional Strata insights soon

Page 51: O'Reilly Strata: Distilling Data Exhaust

Questions?Peter SkomorochLinkedIn@peteskomorochhttp://linkedin.com/in/peterskomorochBlog: DataWrangling.com

Page 52: O'Reilly Strata: Distilling Data Exhaust

Appendix

Page 53: O'Reilly Strata: Distilling Data Exhaust