Building Data Science Teams: A Moneyball Approach

Preview:

Citation preview

1© Cloudera, Inc. All rights reserved.

A Moneyball ApproachJosh Wills | Senior Director of Data Science

Building Data Science Teams

2© Cloudera, Inc. All rights reserved.

About Me

3© Cloudera, Inc. All rights reserved.

A Team Building Exercise

4© Cloudera, Inc. All rights reserved.

Data Scientist Supply vs. Data Scientist Demand

5© Cloudera, Inc. All rights reserved.

Recruiting Techniques

6© Cloudera, Inc. All rights reserved.

Moneyball and Data Science

7© Cloudera, Inc. All rights reserved.

Choosing The Right Metrics

8© Cloudera, Inc. All rights reserved.

1. Analyzing “Unstructured” Data Sources

9© Cloudera, Inc. All rights reserved.

2. Building Machine Learning Models

10© Cloudera, Inc. All rights reserved.

3. Turn Static Reports Into Analytical Applications

11© Cloudera, Inc. All rights reserved.

Answering More Questions in Less Time

12© Cloudera, Inc. All rights reserved.

How To Answer QuestionsLike A Data Scientist

13© Cloudera, Inc. All rights reserved.

1. Read and deserialize input data.

2. Project/filter input records.

3. Shuffle: serialize it, send over the network, deserialize it.

4. Apply aggregation logic.

5. Serialize output data.

The Life of a Data Processing Job

14© Cloudera, Inc. All rights reserved.

Handling the Cost of Serialization

15© Cloudera, Inc. All rights reserved.

The Traditional RDBMS Approach

16© Cloudera, Inc. All rights reserved.

The Cost of The Traditional RDBMS Approach

17© Cloudera, Inc. All rights reserved.

Query Scheduling and Exploratory Data Analysis

18© Cloudera, Inc. All rights reserved.

The Spark Approach

19© Cloudera, Inc. All rights reserved.

The Cost of the Spark Approach

20© Cloudera, Inc. All rights reserved.

The MapReduce Approach

21© Cloudera, Inc. All rights reserved.

MapReduce In The Hands of a Data Scientist

22© Cloudera, Inc. All rights reserved.

Example: Hive Multi-Insert

23© Cloudera, Inc. All rights reserved.

Our Goal: Public Transit for Questions

24© Cloudera, Inc. All rights reserved.

Data Modeling for Data Scientists

25© Cloudera, Inc. All rights reserved.

Motivating Example: Spelling Correction

26© Cloudera, Inc. All rights reserved.

Event Series Analytics

27© Cloudera, Inc. All rights reserved.

A Simple Star Schema for Spell Correction

28© Cloudera, Inc. All rights reserved.

The Combinatorial Explosion

29© Cloudera, Inc. All rights reserved.

• What parameters does this model need…• during the analysis phase?• during deployment?

• Some Candidates• Lag time between events• Similarity of queries• What else?

Designing the Spell Correction Data Product

30© Cloudera, Inc. All rights reserved.

A Supernova Schema for Search

31© Cloudera, Inc. All rights reserved.

Spell Correction in SQL

32© Cloudera, Inc. All rights reserved.

Exhibit: http://github.com/jwills/exhibit

33© Cloudera, Inc. All rights reserved.

Querying Nested Types with Impala

34© Cloudera, Inc. All rights reserved.

• Core Metric: # Outputs/ # Jobs• Measure on both an individual and

aggregate level• Drive the marginal cost of asking one

additional question towards zero• Point business analysts at output

tables for interactive analysis with Impala• Self-serve BI frees up resources

(compute + data science time)

Trading Up: From Data Analyst to Data Scientist

35© Cloudera, Inc. All rights reserved.

Thanks!@josh_wills