Training a New Generation of Data Scientists

Preview:

Citation preview

Josh Wills | Senior Director of Data Science

Training a New Generation of Data Scientists

About Me

What Do Data Scientists Do?

What I Think I Do

What Other People Think I Do

What I Actually Do

The Emergence of Data Science

Data Storage in 2001: Databases• Structured schemas• Intensive processing

done where data is stored• Somewhat reliable• Expensive at scale

Data Storage in 2001: Filers

• No schemas, stores any kind of file• No data processing

capability• Reliable• Expensive at scale

And Then, This Happened

Data Economics, Return on Byte

Big Data Economics• No individual record is

particularly valuable• Having every record is

incredibly valuable• Web index• Recommendation systems• Sensor data• Market basket analysis• Online advertising

Enter Hadoop

The Hadoop Distributed File System• Based on the Google File

System• Data stored in large files• Large block size: 64MB to

256MB per block• Blocks are replicated to

multiple nodes in the cluster

Simple, Reliable, Distributed Processing: MapReduce

•Map Stage• Embarrassingly parallel

• Shuffle Stage: Large-scale distributed sort• Reduce Stage• Process all the values that have the same key in a single step

• Process the data where it is stored•Write once and you’re done.

Thinking Like a Data Scientist

Solving Problems vs. Finding Insights

Parallelize Everything

Abundance vs. Scarcity

Building Data Products

Create a Data Science Team

Choose Good Problems

Design the Model

Mind the Gap

Amortize Costs

Measure Everything

Rinse and Repeat

Work Like a Data Scientist

Train Like a Data Scientist

Hadoop Developer Training

Hive and Pig Training

Introduction to Data Science

Introduction to Data Science:Building Recommender Systems

http://university.cloudera.com/

• Submit questions in the Q&A panel

• Watch on-demand video of this webinar at http://cloudera.com

• Follow Josh on Twitter @josh_wills

• Follow Cloudera University @ClouderaU

• Thank you for attending!

Register now for Cloudera training at http://university.cloudera.com

Use discount code DSvideo_10 to save 10% on new enrollments in Cloudera-delivered training classes until June 1

Use discount code 15off2 to save 15% on enrollments in two or more Cloudera-delivered training classes until June 1

Recommended