O'Reilly Strata: Distilling Data Exhaust

Preview:

DESCRIPTION

Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata. Includes a preview of some of the technology behind LinkedIn Skills, which I launched in a Keynote with DJ Patil the following day. Video: http://blip.tv/oreilly-promos/distilling-data-exhaust-4780870

Citation preview

Distilling Data Exhaust

How to Surface Insights & Build Data Products

Feb 2, 2011Peter SkomorochLinkedIn@peteskomoroch

What is Data Exhaust?

What is Data Exhaust?

My Delicious Tags

What is Data Exhaust?

Words I use on Twitter

What can you do with it?

•Data has value

•I’ll share some lessons I’ve learned about how to extract that value

•We’ll go through a case study

Part 1) 10 Lessons Learned

1) Choose a meaningful problem

http://www.flickr.com/photos/aloshbennett/

•Find pain points

•Work on stuff that matters

•Look for underutilized data

2) Find or collect relevant data

•DataWrangling

•InfoChimps

•Pete Warden

•Factual, SimpleGeo

•Mechanical Turk

3) Raw is better than processed

•Normalization could be incorrect

•Data might be lost or corrupted

•Good approach: public.resource.org

http://www.flickr.com/photos/nedraggett/347280918/

4) Guide user input when you can

•Auto suggest

•Validate inputs

•Collect tags, votes

•Makes data scrubbing easier

5) Solve easier problems first

http://where2conf.com/where2010/public/schedule/detail/12400

6) Build a baseline model quickly

•Iterate rapidly after baseline is done

•Measure accuracy on hold out test set

7) Test code on sample data

build logical sample data

8) Use Continuous Integration

8) Use Continuous Integrationhttps://github.com/matthayes/azkaban

9) Pick the right tool for the job

10) Developer productivity is key

•Fast Iterations: Python, Ruby, Pig

•Convention over configuration

•Embrace Github, DevOps, & EC2

•Currently using JRuby & Sinatra

SNA Team: sna-projects.com

Part 2) Case Study: Strata

Conference Insights

•I’d like to understand the audience at Strata

•What companies do we work for?

•What are the top skills at Strata?

•Do attendees cluster together based on skill?

Round up a Data Viz team

Use the right tools

•Data Crunching: Hadoop, Pig

•Statistical Work: Python, NumPy

•Visualization: Gephi

Find Some Data: Attendees

Add LinkedIn data

Extract Skills from Profiles

What are skills?

Extract

Build Hadoop Skill Graph

Discover

Core Talent Graph for “Hadoop”Igor Perisic

The Talent Graph

We can combine skills with the attendee directory to better understand Strata

What are skills @Strata?

Extract skills for attendees

Top Skills @Strata

Information Overload

Relevance Measures

Jaccard Similarity

TFIDF

Relevant Skills @Strata

Do attendees cluster together based on skills?

•Compute similarity of attendees based on skill vector distance

•Cluster similarities in Gephi

More analysis on the way

•DJ Patil has a session tomorrow

•We’ll blog about additional Strata insights soon

Questions?Peter SkomorochLinkedIn@peteskomorochhttp://linkedin.com/in/peterskomorochBlog: DataWrangling.com

Appendix

Recommended