13
© 2016 ID Analytics, Inc. Confidential So You Want to be a Data Scientist Stephen Coggeshall April 21, 2016

R0 R247 R122 R250 R47 G114 G53 G143 G134 B119 B255 So · PDF file · 2017-07-31– Telecomunication – Social networks – Gaming – Government 6 © 2016 ID Analytics, Inc. Confidential

  • Upload
    vanthu

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

So You Want to be a Data Scientist

Stephen Coggeshall April 21, 2016

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

Outline

2

•  What, why big data

•  What is data science

•  What are the necessary skills

•  Description of the interview process

•  Description of the typical work for data scientists

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

WHAT IS BIG DATA Big Data is a term that describes large volumes of high velocity, complex and variable data that require advance techniques and technologies to enable the capture, storage, distribution, management and analysis of the information.

3

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

DATA SIZES ARE GROWING SUBSTANTIALLY

4

0

10,000

20,000

30,000

40,000

50,000

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Global digital data (in Exabytes)

Sources: IDC

By 2020, digital records will be 44 times larger than in 2009

0

10

20

30

40

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

The Digital Universe: 50-fold Growth from the Beginning of 2010 to the end of 2020

(000

’s)

(Exabytes)

Sources: IDC’s Digital Universe, sponsored by EMC, December 2012

0

400

800

1,200

1,600

2,000

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

Enterprise Unstructured DataEnterprise Structured Data

Exabytes

Data growth exhibits “Moore’s Law” doubling every 1 to 2 years

1970 1980 1990 2000 2010Stor

ed D

igita

l Inf

orm

atio

n(E

xaby

tes)

• Digital data is growing at 62% yoy vs. structured data at 22%• 80% of this new data is digital data that is

complex to analyze in its native structure

Web Application Data

Business Transaction Data

78% CAGR 2011–2016

0.6 1.32.4

4.26.9

10.8

0

4

8

12

16

2011 2012 2013 2014 2015 2016

Exabytes per Month

0.24 0.51.2

2.2

3.8

6.3

01234567

2010 2011 2012 2013 2014 2015Tera

byte

s per

mon

th (M

M)

0

10,000

20,000

30,000

40,000

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

2016

2017

2018

2019

2020

Exab

ytes

Worldwide Corporate Data Growth

Structured data Unstructured Text

IT.com analyses the80% of data growth:

U nstructured text

Sources: IDC, The Digital Universe 2010

Huge Growth of Data

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

5

DATA IS COLLECTED EVERYWHERE

Proliferation of Sensors (IOT)

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Why - exponential growth of data and compute power

•  What is data science? –  The scientific discipline around typically large and difficult data sets

to solve complex business problems

•  It is applied in many industries, including –  Financial services –  Healthcare –  Insurance –  Telecomunication –  Social networks –  Gaming –  Government

Data Science

6

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

Risk Management •  Scores for credit, fraud, bankruptcy, actuarial… Marketing •  Product offers, pricing optimization, customer analysis… Financial Engineering •  Econometrics, derivative pricing, algorithmic trading… Social Networks •  Network analysis, recommender systems, sentiment analysis… Operations Optimization •  Production lines, chemical processes, manufacturing… Transportation •  Traffic flow, smart grid, route/price optimization, logistics… Search •  Online searches, document matching, link analysis… Health •  Bioinformatics, drug discovery, image analysis… …

Examples of Data Science Applications

7

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

Business Analytics Hierarchy

Descriptive BI dashboards, reporting What’s going on?

Explanatory Why are things happening?

BI/BA analysis, slicing data different ways, diagnostics

Predictive What will happen? Forecasts, predictive models, machine learning

Prescriptive Optimization of actions, recommendations

What should I do with my current

businesses?

Strategic What new business opportunities do I

have?

New products & services around existing data, new data to acquire

Valu

e

Question How and what

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Strong math and statistics

•  Strong programming & software best practices

•  Passion for data

•  Machine learning methods

•  Big data technologies (scalable file systems and computing

architectures)

•  Scientific exploration, discovery and testing discipline

•  Curiosity, skepticism, detail oriented

•  Desire to learn

Desired Data Science Background and Skills

9

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Lots of data manipulation, cleaning, exploration, analysis, discovery. Frequently deal with messy data

•  Problem definition: –  Fully understand business problem and goals –  Understand details of the underlying data –  Frame the problem: time series? predictive model?

simulation? structure discovery? –  What data is available? What are the inputs/outputs?

•  Build preliminary models, iterate with business •  Finalize, implement •  Monitor, rebuild as needed

Typical Data Scientist Work

10

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Solid foundation in basics: math, statistics, programming •  Buy a few books. Possibles:

–  Elements of Statistical Learning, Hastie, Tibshirani, Friedman –  Pattern Recognition and Machine Learning, Bishop –  Pattern Classification, Duda, Hart, Stork –  Foundations of Predictive Analytics, Wu, Coggeshall

•  Many excellent courses on Coursera. Chose the technical ones.

•  Internship if possible, but not necessary. •  Now you know the basics and can converse. Interview

for beginner position.

How to Get Into Data Science as a Career

11

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Resumes selected for interviews •  Typically two phone interviews

–  Discuss resume, questions about ML, data, statistics, programming, math…

•  In-house interview –  Usually starts with a presentation, typically dissertation

work –  Series of individual and group interviews, ~45 mins –  Same Q’s as phone interview, but in more detail

•  The group meets next day, go around room, –  Technically adept, smart, experience, culture fit?

Description of the Interview Process

12

© 2016 ID Analytics, Inc. Confidential

Primary IDA Color Palette

R0 G53 B95

Navy Blue

R0

G114 B156

Blue

R122 G134 B142

Light Gray

R102 G49 B144

Purple

Orange

R191 G191 B191

R84 G148 B63

Green

R106 G154 B194

Light Blue

Secondary Color Palette - Medium

R47

G163 B255

R250 G187 B119

R147 G201 B129

R42 G197 B255

R154 G155 B157

R165 G113 B206

R165 G194 B218

Gray

R216 G216 B216

Navy Blue Blue

Light Gray Purple

Orange

Green Light Blue

Gray

R247 G143 B30

•  Sensors everywhere, data collection everywhere, compute power growing exponentially

•  Machine learning algorithms continue to grow in power and complexity –  Rise in deep learning, convolutional neural nets, distributed/

scalable machine learning

•  Businesses relying more and more on data science

•  The world is awash in data. Need people to transform data into intelligence into actions.

Summary: Data Science is a Surging Field!

13