Strata-Hadoop World 2013
Building better analytics workflows
www.datapad.io
Wes McKinney
2
• Former quant @ AQR (a hedge fund)
• Creator of Pandas project for Python
• Author of Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
@wesmckinn
www.datapad.io
• > 20k copies since Oct 2012• Bringing many new people
to Python and data analysis with code
3
www.datapad.io
• Increasing data scale
• More and more data munging/integration
• Need for Statistics and Predictive Analytics
• Building complex data visualizations
• Inadequacy of Excel or other UI-driven data tools
4
Why so many learning to program?
www.datapad.io5
Acquisition Preparation Visualization Analysis Sharing
The Analytics Workflow
www.datapad.io7
What do we care about?
•Minimize time to answer
•Ask more questions
•Reduce friction between tools and processes
•Team productivity
www.datapad.io9
What can go wrong?
•Inefficient workflows lead to lower quality analysis
•Results may not be actionable in a reasonable time-frame
www.datapad.io11
Three type of problems
•Tooling
•Workflow management
•Collaboration
www.datapad.io
For programmers, luckily it’s not 2005 anymore
•R: Hadley Wickham’s packages
•Python: pandas
•Hadoop: Pig
www.datapad.io
Data preparation withvisual tools
•Google OpenRefine
•Google Fusion Tables
•Microsoft Excel
•Data Wrangler
www.datapad.io
Some new startups building data preparation tools
www.datapad.io
Business Intelligence:essential for doing business
www.datapad.io
BI macro-trends
•Self Service BI
•Visual Discovery
•SQL on Hadoop
www.datapad.io
Predictive analytics pitfalls
•Signal vs. Noise
• Identify the right patterns
•Uncertain ROI
www.datapad.io
Some analytics workflow problems still need work
www.datapad.io
Friction between tools:a typical scenario
•Excel and SQL for data wrangling
•Tableau for visualization
•SPSS/R for modeling
www.datapad.io30
A
B
C D
E
F
Data workflows as dependency graphs?
www.datapad.io31
Data workflows as dependency graphs?
CHRONOS
www.datapad.io
Leveraging diverse skill sets
•Within teams, different competencies
•Work together on a data project - sharing code, data, tracking changes
www.datapad.io
Make an impact
•Getting results into the hands of people who need it
•Getting models "into production"
www.datapad.io
Accessible data science...with training wheels
www.datapad.io
•http://datapad.io
•Founded in 2013, located in SF
• In private beta, join us!
•Hiring for engineering