O'Reilly Strata: Distilling Data Exhaust

Distilling Data Exhaust

How to Surface Insights & Build Data Products

Feb 2, 2011Peter SkomorochLinkedIn@peteskomoroch

What is Data Exhaust?

My Delicious Tags

What is Data Exhaust?

Words I use on Twitter

What can you do with it?

•Data has value

•I’ll share some lessons I’ve learned about how to extract that value

•We’ll go through a case study

Part 1) 10 Lessons Learned

1) Choose a meaningful problem

http://www.flickr.com/photos/aloshbennett/

•Find pain points

•Work on stuff that matters

•Look for underutilized data

2) Find or collect relevant data

•DataWrangling

•InfoChimps

•Pete Warden

•Factual, SimpleGeo

•Mechanical Turk

3) Raw is better than processed

•Normalization could be incorrect

•Data might be lost or corrupted

•Good approach: public.resource.org

http://www.flickr.com/photos/nedraggett/347280918/

4) Guide user input when you can

•Auto suggest

•Validate inputs

•Collect tags, votes

•Makes data scrubbing easier

5) Solve easier problems first

http://where2conf.com/where2010/public/schedule/detail/12400

6) Build a baseline model quickly

•Iterate rapidly after baseline is done

•Measure accuracy on hold out test set

7) Test code on sample data

build logical sample data

8) Use Continuous Integration

8) Use Continuous Integrationhttps://github.com/matthayes/azkaban

9) Pick the right tool for the job

10) Developer productivity is key

•Fast Iterations: Python, Ruby, Pig

•Convention over configuration

•Embrace Github, DevOps, & EC2

•Currently using JRuby & Sinatra

SNA Team: sna-projects.com

Part 2) Case Study: Strata

Conference Insights

•I’d like to understand the audience at Strata

•What companies do we work for?

•What are the top skills at Strata?

•Do attendees cluster together based on skill?

Round up a Data Viz team

Use the right tools

•Data Crunching: Hadoop, Pig

•Statistical Work: Python, NumPy

•Visualization: Gephi

Find Some Data: Attendees

Add LinkedIn data

Extract Skills from Profiles

What are skills?

Extract

Build Hadoop Skill Graph

Discover

Core Talent Graph for “Hadoop”Igor Perisic

The Talent Graph

We can combine skills with the attendee directory to better understand Strata

What are skills @Strata?

Extract skills for attendees

Top Skills @Strata

Information Overload

Relevance Measures

Jaccard Similarity

Relevant Skills @Strata

Do attendees cluster together based on skills?

•Compute similarity of attendees based on skill vector distance

•Cluster similarities in Gephi

More analysis on the way

•DJ Patil has a session tomorrow

•We’ll blog about additional Strata insights soon

Questions?Peter SkomorochLinkedIn@peteskomorochhttp://linkedin.com/in/peterskomorochBlog: DataWrangling.com

Appendix

O'Reilly Strata: Distilling Data Exhaust

Technology

Water Distilling

Craft Whiskey Distilling

DISTILLING YOUR DREAM - Startup Distillery · Introduction Chapter 1: Welcome to Distilling Your Dream DISTILLING YOUR DREAM: The Step-By-Step Kit for Business Plans 2 Distilling

amee at O'Reilly Strata 2012

Complete Practical Distilling

Distilling Software Architectural Primitives

Craft of Whiskey Distillingaussiedistiller.com.au/91C2Ad01.pdf · Whiskey Distilling Published by The American Distilling Institute (ADI) The American Distilling Institute is the

Distilling Column

Medieval Distilling-Apparatusof Glass and Pottery Distilling... · Medieval Distilling-Apparatusof Glass and Pottery By STEPHEN MOORHOUSE with an introduction by FRANK GREENAWAY Keeper,

Strategic Behavior in Whiskey Distilling, 1887-1895troesken/papers/fail6.pdf · Strategic Behavior in Whiskey Distilling, 1887-1895 by ... Distilling and Cattle Feeding Company.

Medieval Distilling-Apparatusof Glass and Potterym.alchemypottery.com/articles/Medieval Distilling-Apparatus of... · Medieval Distilling-Apparatusof Glass and Pottery ... 4 'Geber'here

Craft of Whiskey Distilling - doc-developpement-durable.org · Craft of Whiskey Distilling Published by The American Distilling Institute (ADI) The American Distilling Institute is

UX DISTILLING

Distilling Catalog

Heritage distilling

CityGrid Architecture + API Overview from O'Reilly Strata Conference

Distilling GRU with Data Augmentation for Unconstrained ...icfhr2018.org/SlidesPosters/Slides-Paper60.pdfMulti-layer Distilling GRU Multi-layer Distilling GRU Distilling GRU GRU can

Best of Class - Distilling

Distilling Ideas: An Introduction to Mathematical …“Distilling˙Bev˙elec” — 2013/7/18 — 13:46 — page iii — #3 Mathematics Through Inquiry Distilling Ideas An Introduction

O'Reilly Stores