62
Thinking Like A Data Scientist Em Grasmeder ThoughtWorks Data Witch @emilyagras

Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Thinking Like A Data ScientistEm Grasmeder

ThoughtWorks Data Witch@emilyagras

Page 2: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

About Me

● ThoughtWorks consultant

● Graduate research in economics○ I flunked out

● Generalist within data space

Page 3: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

<Data Science Identity Crisis>

Page 4: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

What Does a Data Scientist Do

● Predictions, categorization, clustering

Page 5: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

What Does a Data Scientist Do

● Predictions, categorization, clustering

● Write software

Page 6: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

What Does a Data Scientist Do

● Predictions, categorization, clustering

● Write software● Visualizations

Page 7: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

What Does a Data Scientist Do

● Predictions, categorization, clustering

● Write software● Visualizations ● magically fix your

business model??

Page 8: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

How is a Data Scientist UsefulModelsCode Visualization

??

?

?

??

?Business Value

Page 9: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Controversial opinions about data scientists● They should be good software and API developers● They should be competent at continuous delivery,

making and managing pipelines, and writing infrastructure as code

● They should speak the language of the business and be involved in conversations about KPIsIf not...

● They might not be very useful

Page 10: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

So how do data scientists actually think?

Page 11: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Let’s answer that question with a story about

Cholera!

Page 12: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Cholera Facts (yay!)Deadly bacteria that can kill within hours

The water in your body just comes out from everywhere

Page 13: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Cholera Facts (yay!)

Deadly bacteria that can kill within hours

The water in your body just comes out from everywherePretty much curable (90% of cases) with salty, sugary water that costs $0.10

Used to be a problem in London, is still a problem in some places

Page 14: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

The cause of cholera and how it

spreads was unknown 1854.

They thought it was “miasma”

literally, bad air

Page 15: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

🤔

Page 16: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

The Broad Street Cholera Outbreak of 1854

John SnowThe data

Page 17: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

💡

Page 18: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

London’s cholera outbreak

Page 19: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

<Awkward Switch To Jupyter Notebook>

Page 20: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

London’s cholera outbreak

Page 21: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

London’s cholera outbreak

John SnowThe data

Page 22: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Formally write your hypothesis

● H0 is called the Null Hypothesis● In 1850s England, the Null Hypothesis is “bad airs”

Page 23: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Formally write your hypothesis

● H0: Thing is normally distributedOR

● H0: Thing is uniformly distributed● H1: Thing is distributed differently because reason

* H1 is the same thing as HA

Page 24: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 25: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Refining hypotheses● Assume normal or

uniform distributions across population

● We know population isnot uniformly distributed

● What we might consider: Map infection-per-capita

Page 26: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Formally write your hypothesis● H0: People living in equally

odorous parts of town will have a uniform likelihood of contracting cholera

Page 27: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Preposterous!What if we actually talked to the poor

people?

Page 28: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Fun fact! The word statistics, in 1770 meant the "science dealing with data about the condition of a

state or community"

Page 29: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Collecting more data Workers at brewery were unaffected while their families diedSome children died while their families lived One woman was a complete outlier and the only person in her neighborhood to die

Page 30: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Formally write yourhypothesis● H0: People living in equally

odorous parts of town will have a uniform likelihood of contracting cholera

● HA: People contract cholera because they take water from certain wells

Page 31: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

The greatest value of a picture is when it forces us to notice what we never expected

to see. - Tukey “Exploratory Data Analysis” 1977

Page 32: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

So what are the lessons?Data is good. More data is better

Visualize your data!Do we really need machine learning for this?

Page 33: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Data Exploration: The first step towards using a model

John SnowThe data

Page 34: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Data Exploration

Page 35: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

The data

Your model ->

It’s, you… you’re the scientist in this

metaphor now

Thinking about the business

Page 36: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

1. You have the data, you know its shape, if feels natural in your hands.

2. You know the business value you want to deliver

Page 37: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Let’s talk about models

Page 38: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“In nearly every detective novel... there comes a time when the investigator has collected all the facts he needs for at least some phase of his problem. These facts often seem quite strange, incoherent, and wholly unrelated. The great detective, however, realizes that no further investigation is needed at the moment, and that only pure thinking will lead to a correlation of the facts collected. So he plays his violin, or lounges in his armchair enjoying a pipe, when suddenly, by Jove, he has it! Not only does he have an explanation for the clues at hand but he knows that certain other events must have happened. Since he now knows exactly where to look for it, he may go out, if he likes, to collect further confirmation for his theory.” - Einstein & Infeld, The Evolution of Physics 1938

Page 39: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Let’s talk about models

f(a, b, c, ...) + ε = y

Page 40: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Data is good. More data is betterTry to move as much as possible from the ε into the function

f(a, b, c, ...) + ε = y

Maybe b comes from an external APIMaybe c is too complicated and needs to be split into d and eMaybe g is derived from a function/calculation based on other records or parameters a and b

Page 41: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Data is good. More data is better. Unless it’s not

f(a, b, c, ...) + ε = y

Sometimes b and c are just confusing the algorithmMethods of dimensionality reduction or principle component analysis help extract a signal from noise, and help prevent overfitting

Page 42: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

In any case, you have to understand how you want to get value out of your f(a, b, c, ...) + ε = y

f(a, b, c, ...) + ε = y

Page 43: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Linear Regression Invented“New Methods for Determination of the Orbits of Comets”, - Adrien-Marie Legendre 1805

Page 44: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Double Helix Discovered (1953)

Alan Turing Theorizes about Genetic Programming(1950)

Forsyth writes code that evolves(1981)

Page 45: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Neuron Doctrine (1894)

McCulloch and Pitts lay theory for Neural Networks(1943)

Rosenblatt implements first neural network(1958)

Page 46: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 47: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Page 48: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Page 49: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Page 50: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Page 51: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Page 52: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

“Remember that all models are wrong; the practical question is how wrong do they have to be to not be useful.”

- George Box(“one of the great statistical minds of the 20th century”)“Empirical Model Building and Response Surfaces”, 1987

Thinking like a data scientist means making

pragmatic choices

Page 53: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Thinking like a data scientist means making

pragmatic choices

Page 54: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 55: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 56: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Thank you!

I’m Em Grasmeder(@emilyagras)

Page 57: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Let’s talk about models

Page 58: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical

Models The first step towards using a model

Page 59: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 60: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 61: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical
Page 62: Thinking Like A Data Scientist - GOTO Conference · how wrong do they have to be to not be useful.” - George Box (“one of the great statistical minds of the 20th century”) “Empirical