52
SDS PODCAST EPISODE 367: BUILDING DATA PIPELINES FOR COVID-19 MODELING

SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

SDS PODCAST

EPISODE 367:

BUILDING DATA

PIPELINES FOR

COVID-19

MODELING

Page 2: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: This is episode number 367 with Astrophysicist and

Online Data Science Instructor, Sam Hinton.

Kirill Eremenko: Welcome to the SuperDataScience podcast. My name

is Kirill Eremenko, Data Science Coach and Lifestyle

Entrepreneur. Each week, we bring inspiring people

and ideas to help you build your successful career in

data science. Thanks for being here today and now,

let's make the complex simple.

Kirill Eremenko: Welcome back to the SuperDataScience podcast

everybody. Super pumped to have you back here on

the show. Today, we're hosting Sam Hinton, who's

returning for the second time round. The first time he

was on this podcast was in episode number 303 in

October 2019 where we talked about hypothesis

testing and what it means for the world of data science

along with other topics. That episode was hilarious. I

highly recommend checking it out. That's SDS 303.

This one is going to be super fun as well.

Kirill Eremenko: Sam Hinton is always fun to talk to. He's got a great

personality and very outgoing and loves to share

things. I had a lot of laughs and I'm sure you're going

to have a lot of laughs along the way with us. What is

happening in Sam's life? What did we talk about?

Number one, very important, I think you're going to be

very interested in this is that Sam is the lead data

analyst for the COVID Critical Care Consortium, which

is one of the largest studies in the world right now

looking into COVID-19 and what is happening to

people who end up in critical care, things like

ventilation and other factors.

Page 3: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: You will get a lot of interesting thoughts from a data

scientist who's actually working, like spearheading

this direction working with other scientists in over 100

or approximately 100 different countries around the

world. You'll find out what they're looking into. In

addition, Sam will talk about some of the challenges of

data like what are the real world challenges that data

scientists face?

Kirill Eremenko: Right now, he's facing all of this data that's coming in,

that is inaccurate or maybe incomplete. In many

cases, a thing that has to be cleaned up, has to be

normalized. Lots of pre-work on the data has to

happen and you'll find out how he's building this data

pipeline and what it means. We'll be talking quite a bit

about data pipelines. Very, very interesting. I'm sure

everybody can get a lot of value out of this.

Kirill Eremenko: We'll also talk about data modeling, Bayesian statistics

and DataScienceGO Virtual and how Sam will be

joining us to run a workshop there. Make sure to

listen in on that, that's going to be very cool and

maybe that workshop will be right for you. At the end,

we'll talk about astrophysics. You'll find out some cool

things about dark energy and dark matter. Super

exciting, super fun podcast. I can't wait for you to

check it out.

Kirill Eremenko: Without further ado, let's dive straight into it. I bring

to you, Sam Hinton, astrophysicist and online data

science instructor.

Kirill Eremenko: Welcome back to the SuperDataScience podcast

everybody. It's super fun to have you back here on the

Page 4: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

show. Today, we got a very special guest, Mr. Sam

Hinton. Sam, welcome back. How are you doing?

Samuel Hinton: Thanks for having me again. It's always a pleasure.

Kirill Eremenko: Fantastic, man. Second time, how long was it the last

time? The last time was like, what was it, eight months

ago or so we chatted?

Samuel Hinton: I have no idea. More than a week which means I barely

remember it.

Kirill Eremenko: How have you been since then? Things going well?

Samuel Hinton: Things have been hectic. I'm sure that there's a lot of

people where they've lost their jobs and things aren't

hectic. A lot of other people in recent days who now

have 10 times the workload and don't know when to

sleep. I guess I'm lucky to be one of the second group.

Kirill Eremenko: Yeah. No, that's good. Why did you not know when to

sleep? What's happening in your world?

Samuel Hinton: I've got my normal job. I'm a postdoc at the University

of Queensland. I'm trying to lead the Dark Energy

Survey, Supernova Cosmology Analysis. Lots of fun,

astrophysics science that I'm supposed to be doing. I

am now also the lead data analyst for the COVID

Critical Care Consortium, which is as of writing right

now, I think, the largest international study on

COVID-19 in the world, specifically looking at things

like ventilation and all these stuff that we know is

quite difficult with COVID.

Kirill Eremenko: Wow, that's pretty cool. First of all, what countries are

in that consortium?

Page 5: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: We've got almost 50 countries now. I know the US

signed on weeks ago. Now that all the legal agreements

are in place, their hospitals are coming online. We've

got data from Estonia, from Kuwait, from almost, well,

a whole ton of European countries apart from France.

France has their own study and they're not joining

ours. We don't have any from Russia either. Almost

every other country is signing up. We've hit almost 50

countries, hundreds upon hundreds of different

hospitals sites. Soon, the data should start pouring in,

we hope.

Kirill Eremenko: Okay, but tell me like how did you get this job? Out of

all data scientists in Australia or in the world, I don't

know even, how did you get this?

Samuel Hinton: Mostly, luck. A lot of things in life are luck or just

being in the right place at the right time. It turns out

that in this giant collaboration, the data is hosted at

Oxford. Oxford and the parent company overseeing

this study are sort of like the big players. The

University of Queensland has an agreement with

Oxford. People were looking around at UQ for someone

who could do it. They had all these issues with the

data and they needed essentially someone that could

help out on the machine learning side of things, on the

visualization side of things, on the data pipeline side of

things.

Samuel Hinton: People are just going around saying, "Who has

experience?" One of the project, the head machine

learning investigator, talked to my supervisor and my

supervisor, my postdoc supervisor, said, "Oh, well,

why don't you talk to Sam. He's got previous

Page 6: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

experience in all these areas." She talked to me and

then an hour later, she said, "Okay, I want you to lead

this." I said, "Oh boy, this is a lot. Are you sure?" She

said, "Yes." That's been my life for the past couple of

months.

Kirill Eremenko: Yeah, that's really cool. How does this all work? How

big is this whole team? Are you the only lead data

scientists? Are there other lead data scientists, like

other data scientists in different countries, that you're

responsible for Australia? How does this all work?

Samuel Hinton: Yeah. It's a bit of a complicated, I'm not going to say

mess, but it's a very spaghetti-like situation here,

simply because we're dealing with medical data. As

soon as you deal with medical data, there's a whole

bunch of confidentiality and privacy agreement, things

that you have to take into account. Because I was

essentially the first person that got access to the data,

they've now come down and said, "Look, we don't want

anyone else accessing the data."

Samuel Hinton: I'm one of the very few people in the world now that

can get the raw data. One of my jobs is to take this

raw data, run it through a data cleaning

standardization and de-identification pipeline that I

built. Then, distribute those data products to

specifically UQ researchers. We are getting other

universities in. We have, for example, in Brisbane

researchers from QUT. They get added to sort of UQ

system and then they can access the data.

Samuel Hinton: On top of that, we have companies that have reached

out and said, "Hey, we want to help with the machine

Page 7: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

learning. We want to help with this or that." We've had

Amazon and IBM and we're working with both of them

right now and then Fast AI, a whole bunch of

companies. The big issue is that, it's sensitive data. It's

not something that you can just upload onto Kaggle

and have an open source Kaggle competition. You

can't do that.

Samuel Hinton: There's a lot of people that have offered to help and

they're simply unable to because we haven't got data

sharing agreements with them. We would be in very

hot water if we gave them the data. There are data

science ...

Kirill Eremenko: You de-identify the data?

Samuel Hinton: Yes. The issue is, even if it's been de-identified what

you normally have is then a security team comes in,

takes your de-identified data and they want to see if

they can break it, if they can re-identify patients. The

issue is, that step takes a little while to do. We've had

some preliminary groups at UQ say like, "These are the

variables that are quasi identifiers and if you combine

it with social media data, we may be able to re-identify

a person doing this or that."

Samuel Hinton: Until everything's like proven 100% good enough, it's

hard to even share de-identified data. We're moving

towards that section. Obviously, as soon as these

things come in and as soon as there are legal

agreements and everything in the way, it's no longer

just like a one or two-day task. It's back and forth

between legal teams and things slow down.

Kirill Eremenko: Have you heard of the Netflix Prize on Kaggle?

Page 8: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: There's a Netflix Prize? I haven't.

Kirill Eremenko: There was like years ago, years ago, when Netflix,

which was I think it's like 2000. Oh, my God, I don't

even remember like I don't remember that. Maybe,

years ago basically. Netflix went on Kaggle. They

posted like their de-identified data for people to have

this competition. The prize was a million dollars to

build a recommender system. Or the prize pool was a

million dollars to build a recommender system that

predicts the best way possible what movie you want to

watch next, what show. It was successful. They're

going to launch it.

Kirill Eremenko: I think in 2015, they were going to launch the Netflix

Prize number two, but then somebody wrote a

research paper in the US showing that he could

identify the people from Netflix Prize data by

combining parameters in certain ways. A lot of people,

I think, wanted or either did launch a class action

lawsuit against Netflix for that. How crazy is that?

Yeah.

Samuel Hinton: Yeah. It's definitely a place you don't want to take the

risk.

Kirill Eremenko: Yeah. Hope you're enjoying this amazing episode. I've

got a cool announcement for you and we'll get straight

back to it. Virtual Data Science Conference. Curious?

You've probably heard of DataScienceGO, the

conference that we've been running for the past three

years in Southern California. Maybe you've attended, if

so, it was super cool to have you there. Maybe you

weren't able to attend for the reason of being in a

Page 9: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

completely different country, or the flights were too

long, or the timing wasn't perfect. There could be

plenty of reasons why you weren't able to attend. Now,

we're bringing DataScienceGO to you.

Kirill Eremenko: This June, we're hosting DataScienceGO virtually and

you can attend and get an amazing experience there.

Guess what, the best part is that it's absolutely free.

Just head on over to datasciencego.com and get your

tickets today. This will be our very first time running a

virtual event. Nevertheless, we're still going to combine

the three key pillars of fun, amazing talks and

networking into this event. You'll hear from speakers

like John Krohn, Sam Hinton, Hadelin de Ponteves,

Stephen Welch, and many others.

Kirill Eremenko: Plus, you'll be able to network with your peers. This

event is going to be epic on all fronts and we'd love to

see you there. Head on over to

datasciencego.com/virtual and get your ticket today.

The number of seats is limited. We'd love to have

everybody there. For our very first event, we're limiting

the number of seats to make it more manageable.

Make sure to get your tickets today, if you want to be

part of this. On that note, I look forward to seeing you

there. Now, let's get back to this amazing episode.

Kirill Eremenko: Yeah, okay. All right. You get all this data. What is this

data science pipeline? Tell us about it. Of course, by

the way, for everybody listening, none of this is

medical advice. We're going to as much as possible

avoid ... we are going to avoid sharing sensitive

information that Sam cannot share on this podcast.

Most importantly, none of this is medical advice. If you

Page 10: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

hear anything like the coronavirus, it's opinions only.

With that caveat, what's this data science pipeline?

What goes into this process of building one? Why do

you need one in this specific case?

Samuel Hinton: Right. There are a few things to keep in account when

we're talking about this specific study. The first is that

the database in the system where the data is gathered

was written by a very smart and very talented UQ

researcher. I won't give you the name because I'm sure

... Respect. I want to respect his privacy and people

will end up emailing everyone over everything. He ...

Kirill Eremenko: I'm sorry, just to add. UQ is University of Queensland

in Australia ...

Samuel Hinton: Yes.

Kirill Eremenko: One of the top universities in Australia.

Samuel Hinton: Yes, good clarification. I tend not to define my

acronyms straight, that comes from my astrophysics

roots. He's made this great database and it's used for a

whole bunch of different medical studies, including the

one that I'm working with which is the COVID Critical

Care Consortium. It was originally named

ECMOCARD, if that rings a bell to anyone listening.

Samuel Hinton: What it means is, it was a very generic way and the

doctors go and they get CRFs. Essentially, they print

out a whole bunch of sheets of paper where they write

down the details, and then someone goes it through

and uploads it into this database. The issue is that

there's very little checks done on the database. It was

Page 11: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

written to be general by a person by himself essentially

and a long time ago too.

Samuel Hinton: It wasn't written this year with all the modern

frameworks. It's a fairly old system. That means that

when the data comes back, we have very little

guarantee on what the data should look like. Dates

don't have to be dates. Numeric values can come

through filled with strings or letters. That's the easy

part to identify because at least we know things

should be numbers. If it comes through, we have for

example, 107, but the O in a 107 is the degree symbol

from degrees centigrade.

Samuel Hinton: There's a lot of weird issues like that simply because

we're mixing European keyboards and non-European

keyboards. Then, even when you get that down, there

are things that haven't been validated. Like, we want

to take patient records every day so that we can track

the evolution. Sometimes, we have two or three

records for the same day. It's like, I've entered the date

row, but there's no validation on that.

Samuel Hinton: Then, even if you do all of that, you now have

hundreds of hospitals from dozens of countries and

they used different units for everything. You get a

whole range of numbers come in. For a lot of the

cases, you know what the units are and you just do

basic unit conversion. Some of the fields don't have

unit as input. You have to try and infer from the

ranges what the unit should be. That's tricky to do

because in medical data, things like lymphocyte

counts can span four orders of magnitude in a living

patient.

Page 12: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: How are you supposed to deal with that? Then, on top

of all of that, because the data is filled in from

someone writing down off a piece of paper, it's highly

incomplete around 80%. If you just turn this into a 2D

data frame, around 80% is missing. That's a huge

problem for imputation especially because in this

current, like right now, when I'm talking, we don't

have that many records.

Samuel Hinton: We have hundreds of patients and less than 100 if we

count just those where we know whether they were

survived and discharged alive, or whether they didn't

survive and succumb to coronavirus. With such little

data, how are you supposed to do effective imputation?

We have to have multiple strategies that we then need

to try and vet. All of that needs to happen every day.

We download the data, clean the data, standardize it,

and try out a bunch of things every single day so that

we can go back to the ICUs, back to the hospitals

when we need and say, "Hey, I think you've put this in

wrong here or maybe this is a really cool, really

interesting novel result."

Samuel Hinton: All of this needs to happen in a very quick, a very

automated fashion to make sure that we can get the

results back as quickly as possible.

Kirill Eremenko: Wow. I can just imagine the doctors, it's like a

battlefield for them. They're running around trying to

save people's lives. The last thing they care about or

last thing that's on their mind is to sit down and

properly, carefully input all the data for Mr. Sam

Hinton in Australia.

Page 13: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Yeah. It's a difficult sell, isn't it? Because especially

writing things down on a piece of paper that someone

then has to copy in. It's just not an efficient way of

doing it. It's one of the things that Amazon reached out

and we said, "Hey, you're well-suited to this, gathering

data and using it." Obviously, there's a whole bunch of

privacy concerns when you decide to bring in a large

corporation.

Samuel Hinton: There's all the legal issues there where we have to be

very careful about whether or not they can actually

have the data at the end. The answer is, no, right?

They will help us gather the data and then the idea is

they don't get access to it. Then, it's okay, do we

develop an app? Do we try and set up Alexa so that the

doctors can simply read out the values into their

phone and it will populate the form for them using

natural language processing.

Samuel Hinton: There's a whole bunch of concerns there but even

then, the doctors don't have time to go through and

even just read out what the values are. Countries like

Germany, those that haven't been massively afflicted

yet as in those that haven't broken through their

capacity in the hospital system, are doing things like

getting student doctors and med students to go out to

the hospitals. They just drive from hospital to hospital.

They take the paper and they enter it into the

databases.

Samuel Hinton: They go around and pick out the values and record

them and enter it. There's simply in many countries,

absolutely, no chance going to get the people that on

the frontlines trying to keep people alive to take a little

Page 14: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

break from that, to do some data entry. It just doesn't

make sense.

Kirill Eremenko: Yeah. Yeah. Okay. All right. Once you have all this

data processed and cleaned, what you're saying is, you

have to do this every day and there's no way for you to

automate it completely that all these checks happen

automatically?

Samuel Hinton: Yeah. At the moment, every day we regenerate our

data products. Every day, we regenerate a new list of

issues that we then go back to the clinicians with and

say, "Hey, by the way, this value from this day looks a

bit funky, can you please double check that or is that

a legit value?" Obviously, that gets sent back, not

every day. We don't want to overwhelm the sites but

the more egregious areas, the things that we can't fix

ourselves, gets sent back. Essentially, the only time we

do send them back is when we're losing data.

Samuel Hinton: There are some fields that we simply need. For

example, when was the patient admitted to ICU? We

want to track their evolution over their stay in the ICU,

which means, if we don't have when they were

admitted, we don't know at what point they fall in the

timeline, and we can't use that data. It seems a

shame, like all we need is this one variable for you to

fill out and then, we can use the 400 other variables

that you put in for this patient. Then, we go back to

them. Then, the other thing that they want is every

day, they don't just want issues. No one wants just

bad news every day.

Page 15: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: We generate daily reports for them. This simply comes

down to some Jupyter Notebooks essentially that are

automatically generated and converted to HTML

documents and they will have interactive plots, where

we can show them basic statistics and demographics

of their patients. We know that COVID-19 affects more

men than women and it affects older people worse

than they affect younger people. We want to keep up-

to-date. The statistics that they have saying, okay,

what are the risk factors? Is arterial hypertension, like

high blood pressure, a large risk factor or not? How

about smoking? How about obesity, diabetes, all of

these different things?

Samuel Hinton: Then obviously, treatment. One of the big questions

which are, what is the difference in outcome when you

look at different treatments? Who are on antibiotics?

Who are on antivirals? Which antivirals are they on?

Which antivirals have different ratios of success versus

failure? All of that is data that we try and generate

every day into an HTML document that we can then

send back to the clinicians.

Kirill Eremenko: Interesting. You said you don't have that many, like

you already have hundreds of patient records that are

complete and you know the full story, wouldn't those

insights be statistically not significant? Like, if you're

inferring that?

Samuel Hinton: Yeah. It's a big problem, which is that in some cases,

we have the numbers. If you're not looking at the

outcome, if you just want to say, "Hey, what are the

demographics of people that are being admitted?" You

don't know the outcome. We can say some things there

Page 16: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

but then, we can't say a lot. This is one of the reasons

why we have to be very careful about what we give to

clinicians and medical doctors because we don't want

to mislead anyone. We don't want to cause any

unnecessary harm.

Samuel Hinton: We have, for example, some clear trends in some, let's

say, pH. Your blood pH levels were broken down into

those that survived and those that didn't. At the

moment, the trends look very different but we can't

give that to the clinicians because if you look under

the hood, at the number of patients that are being

used to generate those trends, it's a tiny, tiny number.

If we give that away, we aren't confident that we've

accounted for the difference in country and ethnicity

and all these factors that differ across the patients

because we don't have a representative sample.

Samuel Hinton: It's something that we always have to keep in mind.

There are currently a whole bunch of data products

and things that we are simply hiding so that we can

see them when we're developing products like the

dashboard, like the daily report. We can't make them

public because the chance that they would mislead

people is simply too high because without the

knowledge of statistics that would help inform the

validity and the confidence of those trends, it's very

easy to make a mistake.

Kirill Eremenko: Got you, man. I'm so glad that you're doing this. Out

of all the people in the world because I remember in

our first podcast, you stressed very strongly that given

the 95% rule for frequency statistic, the P value of 0.05

is not sufficient. That means, like ... this was your

Page 17: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

quote and I've used it many times, 1 out of 20 research

papers out there is wrong. Every 20th research paper

is incorrect simply because we agree that 5%

confidence is sufficient.

Kirill Eremenko: I can just imagine how rigorous you are about not

misleading people and misleading doctors here would

cause people's lives. You have to be very careful.

Samuel Hinton: Yeah. It's so easy to happen because I think we have

around 450 variables. Imagine if 1 in 20 of them are

wrong, and we've drawn conclusions, that's almost 20

different hypotheses that we could incorrectly give if

we just decided, "Hey, P value 0.05, good enough, ship

it out." You'll notice that if you look through all the

papers that are currently being published on COVID-

19 and especially some of the early ones, they're done

on a cohort studies of three, four or five people.

Kirill Eremenko: Five people.

Samuel Hinton: Yeah. There was a study in The Lancet with a patient

count of five. It's like, okay, well, it's good. You got to

get these things out. There's no time to sort of dilly

daddle on it but at the same point like can we trust it?

You don't know.

Kirill Eremenko: Yeah. Yeah. Interesting. Okay. What happens next?

You do these reports back and forth, how's the

workflow? Do you guys like have meetings? Do you ... I

don't know. Do you have like a vision? Is there some

leadership?

Samuel Hinton: I have had around six hours of meetings today. Yes,

there are meetings. There are meetings between the

Page 18: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

different companies we're trying to help out. There are

weekly meetings with the PIs and myself to try and

determine directions.

Kirill Eremenko: What's a PI?

Samuel Hinton: The project investigator.

Kirill Eremenko: Okay.

Samuel Hinton: One of the people leading the project. There are

meetings every Thursday with the clinicians. There are

meetings every Friday with the UQ researchers, who

are trying to apply models onto it. In terms of where to

go in the future, hopefully less meetings. I can't see

that actually happening. There will always be too

many meetings. What we want to do is, once we get

more data, once we can actually be a little bit more

confident in the results that we're getting, hopefully,

we can do some interesting things with it.

Samuel Hinton: At the moment, we've been doing things like

generalized linear models and Cox regression and a

bunch of other little statistical tests to try and answer

some of the queries that the commissions have. The

other thing we want to do is use unsupervised learning

to see if we can cluster the patients because one of the

current questions with COVID-19 are, are their

separate phenotypes. Are there multiple variants of the

virus going around? Do they present differently?

Kirill Eremenko: Like mutations?

Samuel Hinton: Yeah, essentially. Yeah. There's been some marginal

evidence so far published in papers that yeah, it looks

like there might be multiple phenotypes. We want to

Page 19: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

see, can we cluster our results? Do our results

indicate that there might be subgroups? Again, hard to

do with only a few hundred records. Then, we also

want to figure out things like causal modeling. This is

obviously a big issue especially in the medical field

which is, let's say, you notice a trend in some sort of

variable. We don't know is that trend because of

COVID? Is that trend because of the medication? Is

that trend because of any one of 400 billion different

things? You're not quite sure.

Samuel Hinton: You want to see if you can construct causal nets to

determine exactly how the conditional probabilities in

your model lies to see what is actually driving these

trends. Of course, it's extremely complicated especially

in medicine where each patient gets treated

individually. They get treated based on how they're

presenting. It's not like you have a control group that

just gets run through with the same treatments in the

same way. If someone presents differently, they get

treated differently and it's so hard to standardize the

results.

Kirill Eremenko: How are you going to do that? That's a very important

question not just in this application, but in other areas

of life whether it's business or marketing or product

supply chains, like whether even, there's always going

to be these external factors. As we know, correlation

doesn't imply causation. Do you have any tricks you

can share that you think might work?

Samuel Hinton: The trick that we're trying to rely on right now is one

that isn't applicable to anyone else, which is we're

going back to the clinicians. There are obviously

Page 20: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

hundreds of years of medical advice and medical

studies out there that we can try and make use of to

say in other different ... if you don't take COVID, if you

take flu or SARS, like viruses, how do they normally

present? What are the non-causal features more so in

those different pathogens or viruses?

Samuel Hinton: On top of that, I can't think of a good way to explain it.

It mostly just comes down to being a very thorny issue

that we haven't fully solved. There is no ideal solution.

Obviously, you can use clinicians to help inform that

but you can also just use a bunch of different models.

One of the things is, we want consistency in our model

outputs across models. You don't just want to run

some stupid random forest to get a result and just

ship it out. You want a whole bunch of different

models to agree so that you have confidence in it.

Samuel Hinton: Then, you want to use explain ability and

interpretability techniques so that for every model that

you've done, you can actually identify why that model

is saying the things that it does. This is something like

Shapley values or just looking at the weights of each

decision tree in a decision tree. What are contributing

to the final answers? So that you can try and hopefully

get a consistent idea of the causal effects in all of your

models. You hope that they agree.

Kirill Eremenko: Yeah. For that purpose, do you think a neural network

could work?

Samuel Hinton: It could, it could and we will have neural networks

especially with the patient evolution. Our time series

data that well-suits a recurrent neural network,

Page 21: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

something like a long, short-term memory network.

The main issue is, we can't train anyways at the

moment because we only have a few hundred data

points especially...

Kirill Eremenko: I mean ... Sorry.

Samuel Hinton: ... we're going to like...

Kirill Eremenko: I was going to say ...

Samuel Hinton: I was going to say ...

Kirill Eremenko: You go.

Samuel Hinton: I was just going to say, for an LSTM networks or a

deep network, you need a lot of data points and it's

very difficult in our case to create new data. Data

augmentation techniques are very difficult to do on

data that is mostly incomplete. It's difficult for us to do

something like a nearest neighbor imputation because

the dimensionality of our models is in the hundreds

and we only have hundreds of data points. Your

nearest neighbor may be a very great distance in

hyperspace from you. It makes it difficult because you

need to imputate. Then, you need to try and augment

your data without biasing your models.

Kirill Eremenko: Okay.

Samuel Hinton: How to do that with only a few hundred samples for a

novel disease? That's tough.

Kirill Eremenko: Tough. For neural networks, in terms of

interpretability, like, even if you got a neural network

that predicts everything well, assuming you solve

somehow the problem of a low, the small data set, it's

Page 22: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

really hard to interpret why exactly, why are these

neurons behave in a certain way? Wouldn't that be a

roadblock to using neural networks for this problem?

Samuel Hinton: Yes. No, for sure. That's why we're trying to get as

much expertise as possible to come in, people that

have done the similar things before. I know you've

seen things like with convolutional neural networks.

There are ways of breaking down the features so that

you can try and visualize them. Essentially, we need

techniques like that that apply in general. It's very

hard to deal with the neural network, especially as the

depth starts to increase. Even if you try and visualize

what neurons are lighting up, like how do you put that

into something that a human can understand?

Samuel Hinton: It's just a massively complicated linear algebra

function, which we have essentially no intuition over

and it's difficult. While some potential partial solutions

exist for some specific variants of neural networks like

CNNs, I'm not sure, I don't know of a generalized

solution. If someone out there listening to this knows a

generalized solution to neural network interpretability

and explainability, please let me know.

Kirill Eremenko: That's all. Okay, got you, for sure. If anybody listening

has any ideas, I think at this stage that will be very

useful. We'll share Sam's contact details in the show

notes as if he's not getting enough meetings already.

Samuel Hinton: You should have seen, I did an interview with ABC

Radio National last week and it went live two days ago.

I have been flooded by well-meaning people offering

support. I had one lady that come in and say, "Look,

Page 23: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

I'm retired. I'm just isolating in my home in the Blue

Mountains. I have nothing to do. I'm an ex-researcher

in agricultural science, have a statistics background

as well. Do you need a personal assistant to help you

manage all this?"

Kirill Eremenko: Wow.

Samuel Hinton: I was completely flawed by her response and all the

other positive responses we have received. I said no to

her, of course, because the university actually listened

as well when we said we're drowning and we now have

a new project manager, her assistant, a new

administrative assistant on the UQ side of things.

Luckily, we are getting the support that we need. Still,

the response is large.

Kirill Eremenko: That's awesome, man. That's awesome. You are doing

a fantastic job like this work can potentially help stop

this or slow it down. Hats off to you, it's really cool,

really cool. The university might be listening to this

one as well, is there anything else you need? Let's do a

wish list.

Samuel Hinton: A wish list. I wish I could get into America and start

the job that I accepted many, many months ago. I got

offered a very nice fellowship at Lawrence Berkeley

National Lab. I'm supposed to have had my visa

interview and everything planned to fly over there with

the wife and all canceled indefinitely. Who knows, I'll

be at UQ for the foreseeable future and we'll see how

long COVID takes to be consigned to the pages of

history.

Page 24: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: Man, you got married. When did you get married? I

completely, sorry, I missed that.

Samuel Hinton: April 1st.

Kirill Eremenko: Wow. Congrats.

Samuel Hinton: That is our anniversary and we decided it's the best

date to get married because half my friends on

Facebook, especially because of COVID-19, the

ceremony is limited to the celebrant, two witnesses

and me and my wife.

Kirill Eremenko: Yeah.

Samuel Hinton: Five people. It's not like a big affair. When I posted

pictures saying, "Hey, by the way," a lot of people

thought it was a very elaborate joke, which I

encouraged for a solid week until I was like, "Yeah, no,

it's actually real." That was the best I think. I saved so

much money on the ceremony. So much money on the

reception. The honeymoon was a bit lackluster. We sat

down, open Google Maps and just went through Street

View in a few countries. We're like, "Yeah, those look

nice. We'll visit them one day."

Kirill Eremenko: Wow. Okay. How long have you been together?

Samuel Hinton: A while. It's a very short marriage. I think we met 2018

at the end of it, maybe. I'm not quite sure. My memory

is horrible. If you ask her, she will know like the exact

date, the exact time and exactly. Me, I'm just like a

couple years ago is fine.

Kirill Eremenko: Yeah. Awesome, man. Congrats. That's really fun. Very

cool. Awesome. Okay. Hopefully, once this whole sells

Page 25: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

down, you'll do your work there. You've got your PhD,

right? This is your postdoc?

Samuel Hinton: Yes. I've got the PhD, but in a month where the

COVID-19 stuff it still hasn't been on boarded. It was

like submitted and sent off for review so long ago, but

everyone has far better things to do. I'm currently

sitting here, not as a doctor [crosstalk 00:35:44].

Kirill Eremenko: PhD list.

Samuel Hinton: ... yeah, PhD list working like two different postdoc

jobs, pulling my hair out, making courses on the side,

just waiting for my reviewers to eventually come back

and say, "Yeah, it's all good."

Kirill Eremenko: Got you, man. Wow.

Samuel Hinton: They just need to get off your asses. Get me back my

thesis. I [crosstalk 00:36:03] that.

Kirill Eremenko: Got you. Okay. Thank you for the run down on

COVID. Hopefully, things go well there and we all

support you. I'm sure our listeners, please show Sam

some support, send him some nice emails if you are

supporting him. Even if you can't do anything to help,

it's good to know you're listening.

Kirill Eremenko: Speaking of courses, congratulations on launching

your second course, man, like number two. First one

was Python for Statistical Analysis about six or so

months ago. Second one now is, Python for Data

Manipulation. The irony is that's exactly what you're

doing for COVID.

Page 26: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Yes. I'm lucky that everything just fits in so well

together. Yeah, I thought after seeing all the comments

on the stats course, that the biggest skill that the

people taking my course were lacking was the ability to

use libraries like pandas to streamline all of the pre-

processing stuff in their analysis because no one

wants to spend 13 hours crunching numbers to do

half an hour of machine learning or statistical

analysis. I was like, "You know what, okay, pandas it

is. I'll make a crash course for that. Show everyone the

easiest ways and the most efficient ways of doing all

the common tasks."

Samuel Hinton: I hope it's been useful for those that have signed up.

I've got some good reviews so far. Some people have

reached out and said that they really liked it, which is

always pleasing to hear because you don't want to

make it. You don't need to get people to come back

and say, "That was terrible."

Kirill Eremenko: Yeah.

Samuel Hinton: I'm happy. They're happy. I think we're all happy in

isolation.

Kirill Eremenko: That's good, man. Yeah. Just speaking of reviews, you

have some of the highest reviews we've seen across all

the instructors we work with. Both of courses have 4.6

stars or 5 stable, which is really hard to maintain on a

massive platform like Udemy. What's your method?

How do you do it? Maybe, there's people out there

looking to create a course these days, like maybe you

can share some insights. How do you get such great

feedback all the time?

Page 27: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Honestly, I'm not too sure. I think one of the things is,

I keep in mind what I want as a student, which was a

few years ago now. I always remember listening to my

lectures, the online recordings, and just wanting to

un-enroll. Some of them would go on and on, if I stop,

I really didn't care about pages upon pages of just

talking at me before getting down to anything useful.

Samuel Hinton: I decided if I ever made a course, it wouldn't just be

talk. I mean, I would talk about the code that I'm

writing in front of you and try and keep it practical so

there's something for you to do whether it's run the

code in parallel with me or just read over what I'm

reading or listening. Just not droning on. I try and get

to the good stuff quickly, but lectures always end up

taking far longer than I thought.

Samuel Hinton: I remember I recorded a lecture just about histograms

and the first record, it took about three minutes.

Histograms are pretty simple. There's not much to talk

about. Then, I got a ton of questions for people. They

were saying, "Hey, what about this use case? What

about this here? My code isn't working here." I

realized, even with such a simple concept, there are a

whole bunch of little caveats or things that people

don't quite understand intuitively.

Samuel Hinton: I went back and re-recorded it. It became like a 15-

minute video, but people seem to like it. Those that

already knew were able to watch it at double speed

and sort of skip to the parts that they needed. Those

that had never seen it before managed to get all the

relevant information such that they didn't try out the

code to get an error and then have to hit up Stack

Page 28: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Overflow for half an hour afterwards, trying to figure

out what on earth this keyword that they needed

meant. I'm not sure. I just try and keep that in mind

but beats me.

Kirill Eremenko: Yeah, man. That's a good approach to overdeliver

because I actually met ones in one of our live events, I

met a student and she told me like, "Oh, it's so weird

to hear your voice in real life because I've been

listening to you online all the time and you sound so

different." I'm like, "How do I sound so different?" She

said, "I listen to you on double speed. I've never." Yeah,

so a lot of people do that. I encourage people to do

that. I would rather put more into a lecture than less

because you can just listen on double. If there's less,

then people who are not as familiar with the topic,

they will fall behind and we don't want that.

Samuel Hinton: Yeah. There's a personal side of it too, which I'm not

sure if this is of too much interest for those listening. If

you're making a course, so we release the stat course,

the Python for statistical analysis course for free, for a

little while during COVID until we had to stop making

it free because in that one week, we did free, I got

42,000 new students. That's more than being the

course the entire time it's been up. A huge amount.

The issue was okay, well, I have two jobs at the

moment, plus the newly released course and now I

have 42,000 students who ask questions.

Samuel Hinton: Even though they got the course for free, I'm not going

to ignore their questions. I'm going to go in there and

answer them to the best of my ability, but it takes

time. If you have these short lectures that you think

Page 29: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

are like, "This is really efficient," 30 seconds and this

topic is done. You haven't been comprehensive, well,

people will just ask you about the things you haven't

covered. Suddenly, instead of spending 10 minutes

recording an additional one minute in your lecture,

you're spending 10 hours responding to the same

question 400 times and it's just not efficient.

Kirill Eremenko: Yeah. I agree. Tell me this, so once you recorded in

this recent course, data manipulation of Python, you

said pandas. Are you using pandas for the COVID

analysis or are you using some other tool?

Samuel Hinton: Yes, no, we're exclusively using pandas essentially for

the data pre-processing step, to the point where I can't

think of a single function in amongst the pipeline that

I threw together that doesn't make use of pandas in

some way. Date times are best handled in pandas.

Pandas has categorical features, which is amazing.

Everything is pandas. That's not going to change. It's

just such a convenient tool.

Kirill Eremenko: That's so cool. It's a very vivid example of practicing

what you preach. I love that it happened in this order

that you first recorded the course about pandas and

data operation. Now, you're actually using those same

tools. It's a great testament to that these are applicable

tools in industry, in medicine, in whatever, like

emergency situations like this, you know these tools,

you go and use them right away. Really cool. Really

cool.

Kirill Eremenko: What else that I wanted? What I wanted to ask you

about is, I've had this question, so since our last chat

Page 30: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

I've been killing myself over like breaking my head

because I didn't ask you. I was so like tempted

afterwards like I should have asked this question.

We're talking about Bayesian inference. We're

comparing Bayesian inference to frequency statistics,

Fisher and his thing. By the way, I don't know if the

listeners listening to this, I actually read that Bayes

was the 19th century and Fisher, as far as I

understand, Fisher was like 1920s, like early 20th

century.

Kirill Eremenko: Fisher didn't like Bayes, right? He created his own

approach to statistics and so on, which we all use now

and which is taught at school the P values and things

like that. Fisher actually, interestingly enough, he, as

from what I read, he tried to prove that not smoking

causes cancer, but cancer causes smoking. Speaking

of the correlation causation, like because it's a P value,

right? The chart is there. You can run all the tests.

You don't you know, you have to have some additional

knowledge to know which way it works. That's like a

side story.

Kirill Eremenko: The question I wanted to ask you, you were talking

about Bayesian inference and you were talking about

prior probabilities, posterior probabilities, if I'm getting

the names right, and how things are interpreted. You

give this lovely example, by the way, I highly

recommend to listeners to check out the previous

podcast, I'll dig up the episode number and we'll

mention in the show notes. Fantastic episode and you

gave us a great example of like the sun exploding.

Page 31: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: You said, "Okay, so if we have this device on earth that

is looking at the sun and predicts all of a sudden that

the sun's going to explode in the next hour, we can

take all of the prior probabilities. We've seen that the

sun hasn't exploded like prior knowledge that we had.

Sun hadn't exploded in billions of years. Most likely, if

we account for that, then the probability actually goes

down quite a lot." Right? Do you mind repeating or

something like that, the example?

Samuel Hinton: Yeah. The example was there's a box on earth, where if

you push it, like you push a red button on the top, it

will tell you whether or not the sun is going to explode

or not in the next 10 seconds. When you press the

button, what it does is it tosses two dice. If you get one

on both the dice, it just tells you that the sun is going

to explode. Otherwise, it returns the truth. The

frequentist walks up, presses the button, gets

unlucky, it rolls two ones.

Samuel Hinton: They know about the dice, but they say, "Two ones,

that's 1 in 36 chance. That's less than a P value of

0.05." We have a significant result that the sun is

about to explode and they run off to publish. In the

background, the Bayesian statistician is just sitting

there shaking his head trying to bet that it won't.

Kirill Eremenko: Yeah, because he is using or she is using the Bayesian

inference, right, which takes into ...

Samuel Hinton: Yes.

Kirill Eremenko: Can you tell us a bit about this prior probability?

Page 32: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Right. Bayes' theorem is that the posterior, which is

the likelihood, the whole likelihood.

Kirill Eremenko: The end result.

Samuel Hinton: Yeah, so the end result is a combination of likelihood.

Yeah, you're right. I shouldn't say likelihood when I'm

talking about the posterior. It's a combination of the

likelihood and the prior. The likelihood is, what is the

chance of getting the data, given our model. Then, the

prior is what's just the flat chance with our current

information of that model? If you combine all of those,

it's proportional to the probability of your model given

the data, which is different to the likelihood.

Samuel Hinton: If I speak the math very quickly, it's saying that the

probability of theta given D is proportional to the

probability of D given theta [inaudible 00:46:49] times

by the probability of theta, theta being the model. The

idea is that in frequency statistics, we work with the

likelihood and that's fine. That's good. That's what you

need to do. Then, when you look at Bayes factor, you

also add in the prior, which is your prior, past,

existing information.

Samuel Hinton: There is another part, the whole thing is a fraction. On

the bottom is what we call the evidence. Let's not get

into that, because that's a much more conceptually

difficult thing to talk about without actually having

diagrams or being able to write any math. Posterior is

proportional to the likelihood multiplied by the prior.

Kirill Eremenko: Got you. What I was wondering since then, like

literally, I think we hung up and this question popped

to my mind or it was like towards the end of the

Page 33: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

podcast, we're running out of time. Anyway, so do you

know the turkey paradox?

Samuel Hinton: No, no, you're going to have to tell me.

Kirill Eremenko: It's simple. It's super simple, but it's not mathematical.

It's got nothing to do with Tukey, the mathematician

with the, what's it called, T test, I think. It's just about

a turkey like an animal. The turkey is born. It's a bit

scared of everything at the start. Then, the farmer

comes along or whatever. The butcher comes along

and feeds it some corn and say, "Wow, I got some corn

from this butcher. That's amazing. Okay, well, maybe

we can be friends. What's the likelihood of somebody

giving me corn for free?"

Kirill Eremenko: All right? Then, day passes, two days, three and every

day he's getting corn. Then, maybe a month passes,

six months, a year, I don't know how long turkeys are

raised for. Every single day, he's getting this corn. The

prior that's part of the probability, yeah, the prior,

right, it's growing. It's like, "There's evidence that he's

my friend, he's my friend." If you apply Bayesian

inference, the probability of the butcher butchering the

turkey in the turkey's mind is going down all the time

because all the evidence it's seeing is like with the sun,

it's not blown up. I haven't been hurt by this butcher.

Kirill Eremenko: I apologize to the vegans out there, but at some point,

the butcher comes and slaughters the turkey for

Thanksgiving or for some other thing. This really mess

with my head, just like you have the sun example on

one hand, but with the turkey example, the whole

Bayesian inference goes down the drain, because the

Page 34: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

result is inevitable, like it's going to happen. I wanted

to get your thoughts on that. How do you apply

Bayesian inference? Or what does that say about

Bayesian inference?

Samuel Hinton: Nothing. I mean, in this case, the Bayesian inference is

perfectly fine. On any given day, the turkey is very

likely not going to die until the day that the butcher

decides he's had enough with that thing gobbling up

all his bread. That means that Bayesian inference has

served the turkey correctly for everyday but one, which

is a lot of being served correctly.

Samuel Hinton: It's only an issue in our heads because we know that

the butcher is coming. We have access to hidden

information. Our Information is different to the turkey.

We see this and we go, "Oh man, the probability of the

turkey being butchered is so low from the turkey's

point of view." It's like, "It is," from our point of view.

We know it's coming. That's just because our priors

are different. It serves the turkey well, like as

everything does, until it suddenly stops working. If the

turkey lives for a few years, it served it very well for a

very long time.

Kirill Eremenko: Got you. Our priors are different. I think that's a key

that we've seen hundreds of millions of turkeys prior

to that and we know that their result 99.999% is this.

Samuel Hinton: Yeah, precisely. Our conditional probability is

conditioned on the knowledge that we know the

butcher is going to butcher.

Kirill Eremenko: Yeah. Interesting. Yeah. That's a good feature of

Bayesian inference. Not a feature, the more knowledge

Page 35: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

you have, the more accurate your prediction will be.

On that note, do any industries or businesses or I

don't know, applications actually use Bayesian

inference these days? I've heard of a few, but what's

your knowledge in this space? Because it looks like

everybody is using frequency statistics, whereas

Bayesian has a place as well.

Samuel Hinton: I think it's difficult to say because it's easy to mix

things up. One of the giveaways of frequency statistics

is when someone starts talking about P values. We

generally don't do that in Bayesian statistics. If I use

Bayesian statistics and I calculate some variable, I

would say that, X has been detected at 3.8 sigma

confidence or similarly. I'll say, you would use a P

value in the frequency statistics, but that doesn't

mean that someone using frequency statistics can

incorporate prior knowledge.

Samuel Hinton: They just do it under a different formalism. If someone

says, "Hey, are we doing Bayesian technique?" I need

to sit down and say, "Okay, well, how have you

formulated your model? What prior information do you

have? How is that being incorporated?" It's very easy to

try and sneak that information into the likelihood.

That's fine to do in some context, but it does give you

certain different mathematical properties of your

outputs. It's hard to say.

Samuel Hinton: A lot of the cases, you do use Bayesian-like techniques

or using prior information almost everywhere you go.

Every time that you've done something with deep

learning, you may not have run with a Bayesian

neural network, which are things and they're

Page 36: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

wonderful and you should check them out. The fact

that you've trained on 10 million images means that

you're incorporating prior information already. It's just

not under the formalized Bayesian statistics headline.

It's sort of blurred line that's difficult to actually draw

in the sand.

Kirill Eremenko: Okay, okay. Interesting. Thanks. Thanks for the

rundown. You should create a course on Bayesian

inference. I'll be happily take it to learn ...

Samuel Hinton: Yeah. I thought about it, modeling, how to fit models,

whether you're using different MCMC, so Markov

chain Monte Carlo processes or similar. Because

everyone has a model and it's a lot easier to write a

model than it is to correctly fit it to the data and draw

the right inferences from it. That's what I do a lot.

Maybe one day when there's enough interest, I'll write

up and record a course on that. It all depends on what

people want. Everyone get kin for model fitting and let

me know.

Kirill Eremenko: Speaking of what people want, we have a very cool

surprise announcement. Sam is joining us as a

speaker on the Advanced Day at DataScienceGO

Virtual. DataScienceGO Virtual is happening at the

end of June or start of July. We're still deciding on the

date. By the time this goes live, you can find out the

date for sure it's available at datasciencego.com. Get

your tickets there.

Kirill Eremenko: Sam will be joining us as a speaker. What will you be

talking about? Actually, a workshop. What workshop

is it going to be?

Page 37: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Probably something on data science pipelines. Given

that I've spent the past few months writing a few of

these. Years before that, I've been doing them for an

astrophysics context. It seems smart to finally

formalize that and write up a workshop. I've given

workshops in the past on different topics so it'll be

good to add another knock to the belt and finally write

up everything that I've been doing on this.

Samuel Hinton: The idea is quite simple, I think, which is, every data

scientist doesn't want to be doing data cleaning. No

one particularly likes cleaning and standardizing data.

How can you write a pipeline? The easiest, most

flexible and most extensible way possible to streamline

all of that to get you either your data products as

quickly as possible or to go through generate not just

the data products, but then do the machine learning

and validation on them to get you not data products,

but now machine learning products or business

intelligence products at the end.

Samuel Hinton: The less time people spend sort of screwing around

playing with the data, the more time people can spend

actually getting insights from the data.

Kirill Eremenko: Absolutely. The way this topic came about is that we

ran a survey and over 1,700 people interested in

attending DataScienceGO Virtual completed the

survey. Among other advanced practitioners,

specifically, the most popular topic was data science

pipelines by far. The next topics were still popular, but

there was a huge, huge difference in the number one

and number two topics.

Page 38: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: Why do you think data science pipelines is so in

demand right now among specifically advanced

practitioners?

Samuel Hinton: Exactly like what I said, no one wants to spend their

time doing it. Data scientists spend most of their time

not doing data science.

Kirill Eremenko: Cleaning data, right?

Samuel Hinton: Yeah. It's an awful waste of time. It's not a fun job. It's

not a rewarding job. You get a data product at the end

and now you can start your real job, just getting

insight and crunching the models down to actually

extract useful information and being able to do

something. Like we've done with the COVID study,

where every day we get a data refresh and that

happens automatically at 6:00 a.m. It's kicked off, I

don't do anything.

Samuel Hinton: Then, at the end of that pipeline and it takes about five

minutes to run, we have data products that have been

uploaded to secure sites. We have reports available for

people. We have an interactive dashboard. This isn't

currently in but I have the data, we use to take it out.

We will have machine learning products that have

been refreshed each day because you don't want to go

through and say, "Hey, we've got a new data set. We've

got a few extra records."

Samuel Hinton: I'll just manually run these 30 models that are thrown

together and re-compare them. It's like, no, you want

to press a button, go off, have a little nap, come back

and have your results there.

Page 39: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: Yeah, yeah, absolutely. Maybe handle some exceptions

at the most that [crosstalk 00:57:19].

Samuel Hinton: There's always exceptions, isn't?

Kirill Eremenko: Yeah. Okay. How do you teach data science pipelines?

Give us a teaser. What's a workshop on data science

pipelines look like?

Samuel Hinton: Probably a lot of code. There's no way around it.

Maybe one or two slides to try and illustrate the topics.

I guess, the way that I have it in my head at the

moment is it's essentially a collaborative coding.

Everyone's coding their own thing. I will have my pre-

done version and then people can deviate from that as

they will. Probably something like Google Colab or

Jupyter labs in some instance, just to give everyone

the basics.

Samuel Hinton: How you can throw these things together? How you

can chain all these methods in a robust way? Then,

how you can tie that into your machine learning

product? Hopefully you start with here's a bunch of

raw data files. Then, at the end, you press a button

and out come all your products that you want.

Obviously, the way that this has to be done, the

workshop, may be different to how people do it in

industry.

Samuel Hinton: If you have a very large data set, if your data set is

either billions of records or thousands of features, you

may not be able to run this on a laptop. You may need

to ship it out to high performance computers, submit

it to some sort of batching job on a cluster like SLURM

or SGA, a whole bunch of them. That's very difficult to

Page 40: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

do in a workshop. No one wants to spend two hours

setting up and trying to apply for accounts. We'll have

to cover a representative but basic example. Then, give

people the skills or the pointers as to how they can

scale that up.

Samuel Hinton: Whether they're using things like Dask to try and scale

out to clusters or whether they just need to know this

is how you submit jobs to a supercomputer.

Obviously, you can't do all of that in a single

workshop. We'll have to cover the basics as much as

possible. Then say, for your use cases, this is where

you want to go. For your use cases, you're going to go

look over at this. That's actually something that I think

we're going to run a survey with the people that

responded to the first survey saying, what are your use

cases? What, in your mind, is a good data science

pipeline? What do you want out at the end? What are

the products that you're talking about? What are the

inputs that you're dealing with? Are we talking about

megabytes, gigabytes or terabytes of data?

Samuel Hinton: The pipeline changes depending on all of these

questions. That's something that we really need from

the people going to the conference is what are their

use cases? Because only with that knowledge, can we

create an effective workshop that actually benefits

them at the end of the day.

Kirill Eremenko: Absolutely. Yeah. That's one of the reason why we ran

the first workshop to know exactly what people want.

Sorry, so we run the first survey. Now, we're going to

run the second one. If you're listening to this, and for

some reason you weren't part of the second survey,

Page 41: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

maybe we missed your response, in terms of

identifying you as a key participant for the second

survey or these more in depth interviews that we're

conducting, please send either our team or Sam

directly, preferably Sam directly, an email.

Samuel Hinton: No, no, no preferably the team.

Kirill Eremenko: You can send us an email. We'll include it in the show

notes for this episode, so you can find it there.

Basically, send us an email and explain exactly what

you would like to see in this data science pipelines

workshop. When this goes live, we will still have time

to incorporate your feedback.

Kirill Eremenko: Yeah, that'll be cool. I'm looking forward to it. It'll be

virtual so there'll be people from all over the world.

Yeah. You have so much experience with this now

especially with this COVID stuff.

Samuel Hinton: Yeah, sounds like fun. No pressure though. Yeah?

Kirill Eremenko: Hope you find time to not go crazy with all this stuff

around ...

Samuel Hinton: Fingers crossed.

Kirill Eremenko: Another thing that I had in mind, once you're talking

cosmology with my girlfriend? What's your website

again? What's that wonderful website?

Samuel Hinton: cosmiccoding.com.au.

Kirill Eremenko: Yeah. Everybody cosmiccoding.com.au. Don't forget

the .au, we're very special in Australia. Yeah. Amazing.

I love the talk. If anybody's looking, it's called the Dark

Side Of The Universe by Sam Hinton in the

Page 42: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

BrisScience lecture. I've watched about three quarters

of it so far, or maybe two thirds. Amazing. I loved it. I

knew about dark matter. I didn't know there was even

more dark energy in the universe. That's crazy, man.

Samuel Hinton: Yeah. I wish I knew what they were.

Kirill Eremenko: Yeah. I'll due to that. Some cool things. I like that you

provide the evidence of how the charts fall in place and

that this is not just like voodoo stuff. It's actual ...

Samuel Hinton: Yeah. It's probably the most common question I get is

like, well, what if dark energy and dark matter is just a

mistake? What if Einstein was wrong? I was like,

"Okay, he could be wrong but there are dozens of

independent, different avenues of investigating that, all

come to the same conclusion." If it's a mistake,

someone needs to come in and say how, because we

have very, very substantial evidence that it's a real

thing. We just don't know exactly how it's supposed to

function or where it came from.

Kirill Eremenko: Yeah.

Samuel Hinton: One day, we'll have the answer.

Kirill Eremenko: One day, maybe. A million years from now.

Samuel Hinton: Pretty much.

Kirill Eremenko: Is that what your job in America is going to be about?

What's this postdoc?

Samuel Hinton: Yeah, so the postdoc is to investigate dark energy and

dark matter primarily using two different probes of the

universe. The first being Type 1a Supernova and the

second being the large scale structure of the universe.

Page 43: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

To give a very, very tortured and brief intro to both of

them. Type 1a Supernova are sort of exploding star

that explode around the same brightness every time.

What you can do is you can use them to map out the

history of the universe, because remember, light takes

time to travel. A galaxy that has a supernova that we

see that's a billion light years away, well, that

supernova exploded a billion years ago.

Samuel Hinton: Because they're all the same brightness, it means we

can figure out how far away that galaxy is by how dim

the supernova is. If you have a light and you start

walking away, if you walk twice as far away, the

dimness of the light is now a quarter, because the light

is spreading out to cover the area of a sphere, the

sphere 4pi R squared. The idea is with this standard

candle, we can map out the evolution of the universe.

Samuel Hinton: The evolution of the universe changes depending on

the properties of dark energy and dark matter. The

better we can constrain the evolution, the better we

can determine the properties of those mysterious

components. Then hopefully, a theoretician will come

along and say, "I propose dark energy is this with

these properties." We say, "That works or that doesn't

work." At the moment, the leading one is Einstein, who

just said, "Dark energy is probably just if space itself,

empty space, had energy, turns out that fits with

everything."

Samuel Hinton: We just don't know why it should have energy. You

take quantum mechanics and you calculate how much

energy the empty vacuum should have, it's not zero,

right? Quantum mechanics says there should be

Page 44: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

energy, but it says there should be so much more

energy, 100 magnitude more energy than we observe,

which is catastrophically wrong.

Samuel Hinton: The second thing, the second probe is large scale

structure. Let's see, what's the easy way to explain

that? The universe is big now. In space, no one can

hear you scream. That's true. Remember, the universe

is expanding. If we go back in time, the universe gets

smaller and smaller but the amount of stuff doesn't

decrease. The amount of stuff stays the same, but it's

now in a much smaller volume. Space goes from being

empty to being filled with stuff like Earth's

atmosphere. It goes to be thick and dense and it

becomes a fluid. Because [crosstalk 01:05:23].

Samuel Hinton: Yeah. If you go all the way back to right after the Big

Bang, space looks like a fluid. It's got so much stuff in

it and space is smaller, it acts like a fluid. What that

means is, well, quantum mechanics says that right

after the Big Bang, some parts of the universe have

just a little bit more energy than other parts. Energy,

mass, light, they're all the same thing at this point in

time, so it has a bit more stuff.

Samuel Hinton: Imagine you blow up a balloon in the atmosphere.

That balloon, the area inside the balloon, has more air

than outside. You pop it, you get rid of that elastic

shell and what happens? You hear the pop, the air

spreads out. It's a little shock wave. It's generally not a

shock wave because it's just air pressure moving. It's a

sound wave. It's an acoustic wave. You have these in

the early universe. You have these acoustic waves from

these over dense regions spreading out.

Page 45: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: Imagine, it's like you've got a still lake and it starts to

rain, you can see all the ripples from the raindrops

spreading out. That's what the early universe looks

like. I'm taking a little bit of time here, but space was a

fluid back then and it's not a fluid now, which means

at some point, it went from a fluid to not a fluid. This

actually happens incredibly quickly in astronomical

terms. We're not talking billions of years or millions of

years. We're talking thousands of years, very quickly.

Samuel Hinton: Imagine you've got this lake that's been rained on with

all the ripples spreading out. Then, suddenly,

instantly, the lake snap freezes for some reason.

You've got all the ripples will now be imprinted in the

ice because it's frozen straight away. That's what we

see in the universe except we can't hear it. We can't

hear the ripples. If we take a telescope and we

measure 100 million galaxies, we can reconstruct the

ripples because the ripples are patches of over density.

Over density means more mass, more mass, more

gravity, more stars, more galaxies, the more stuff.

Samuel Hinton: If we simply figure out where there's more stuff in the

universe than other places, that's a ripple. We go out,

we find all these ripples, and we use them in a very

standard way that we use the Supernova. Instead of it

being a standard candle, it's a standard ruler. We

know how big the ripples are. We know that they had

300,000 years to expand. We know at what speed they

expanded because it's tied to the speed of light.

Samuel Hinton: If we find a ripple that's X big in some area of the

universe, we know that it started off as Y big. It's

thereby increased in size X over Y. We can figure out

Page 46: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

how much any patch of the universe has expanded.

Again, we use that to map the expansion history. We

try and infer the properties of dark energy and dark

matter. That was about a five minute explanation. I'm

going to call it there. Anyone wants any more detail,

there's a whole bunch of videos on dark energy and

large scale structure. That thing is called the baryon

acoustic oscillation if you're curious.

Kirill Eremenko: Wonderful, wonderful. What do you predict some side

effects of your research? Will we have new MRIs or

anything like that?

Samuel Hinton: It's so hard to tell. The main benefit of current astro

research in breakthroughs that we have in deep

learning and machine learning. We obviously have

images of the night sky. We've tried to identify things

in those images. That's obviously very closely related

to things like identifying tumors or medical

abnormalities in MRI images or similar.

Samuel Hinton: At the moment, it's looking less like a tech

breakthrough, astrophysics gave the world digital

cameras a couple of decades ago. I think we're still

coasting that and we'll coast that for as long as we

can. For now it's just about sharing techniques.

Kirill Eremenko: Okay. Got you. All right. Looking forward to that, it

sounds very exciting. Hopefully, this job goes ahead

very soon. It's going to be fun.

Samuel Hinton: Fingers crossed.

Kirill Eremenko: Okay. One last thing I wanted to chat to you about

before we finish up is what books are you reading? A

Page 47: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

person of your mind, of your breadth of applications

and knowledge in data science and other fields, surely

there are some interesting things you're looking into

where you get all this information from?

Samuel Hinton: Yeah. There are a few. There's a few books that I was

recommending recently on causal modeling that I have

on my to-read list. However, in the past few months, I

will admit that I have not touched a single textbook.

That's not that abnormal for me. With so much work

with having both of these jobs and working nonstop,

when I get a bit of downtime, I pick up my novels.

Samuel Hinton: I just need a break from all of this data science, all of

these data pipelines, I just need to turn off. I'm a huge

reader of fantasy. Brandon Sanderson, all of his books

and similar authors. I've read around, I think, 45

books this year.

Kirill Eremenko: Forty-five books this year. It is April.

Samuel Hinton: Yeah, it's ...

Kirill Eremenko: Ten per month.

Samuel Hinton: I normally average like one every day or two. If I have a

weekend off, I can read an entire book in a day. I

generally feel bad at the end of it. I feel like oh, wow,

you really should have done something else. You could

have been at least a bit productive.

Kirill Eremenko: Yeah. Okay.

Samuel Hinton: I try not be, you know.

Page 48: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Kirill Eremenko: It's crazy, man. It takes me a month to read a book

sometimes. What's the most memorable book even if

it's fiction that you read this year?

Samuel Hinton: Geez. I don't even know the name of them. That's the

issue, I don't keep track. I was like, that's a good book.

I will download it, read it and go on to the next. I don't

remember the authors. I don't remember the names.

Let's see. Hang on, give me one second. I have the

online repository in front of me. Let me just open

books. Yes. Okay. I think one of the ones that I liked

the most by Andrew Rowe was Sufficiently Advanced

Magic, which is just a light hearted fantasy

progression style thing. That was nice.

Samuel Hinton: Then, there was the Cradle series by Will Wright, it

was seven books or so. I read those in about three

days.

Kirill Eremenko: How do you spell that?

Samuel Hinton: Small books.

Kirill Eremenko: Yeah. Small books. Do you do like some speed reading

or something like that?

Samuel Hinton: I used to. I try not to anymore. I did speed reading

years ago and I got up to like several thousand words

a minute. I just realized that there's no fun. If you read

a book in an hour, A, my retention is horrible. I read to

relax. What am I supposed to do if I read everything I

have on my Kindle or my phone in a single day? No, I

probably still read exceptionally quickly but I no longer

try and speed read.

Kirill Eremenko: Okay.

Page 49: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

Samuel Hinton: It's probably still abnormally fast but I just have to live

with that.

Kirill Eremenko: It is. It is. Very impressive that's 45 books in this year.

I'm put to shame. I've read like three, two or three.

Yeah, okay. All right, cool. Thanks for

recommendation, Andrew Rowe and Will Wright, some

good fantasy books if anybody's looking for any.

Kirill Eremenko: Yeah, I think we've covered off everything. Is there

anything you wanted to touch on before we wrap up?

Samuel Hinton: Not particularly. I'm just hoping that I get a weekend

off in the next couple of months and can sit down and

just chill out and perhaps go and actually read a

textbook for once. That would be nice, a change of

pace when things slow down enough that I can just

breathe and learn. Because learning is, I think, one of

the things that I like most. It's so hard to find the time.

Samuel Hinton: If anyone out there is currently stuck in iso and is

dedicating themselves to going through courses and

upskilling themselves, I think, that's an absolutely

fantastic use of time. I wish everyone during that the

absolute. Best of luck. I know that there's a bunch of

people that have lost their jobs recently, and at least

trying to make a positive use of the time that we all

have to spend at home is as much as we can ask.

Kirill Eremenko: Fantastic. Thanks, Sam. A huge thank you on behalf

of all our listeners for doing what you're doing. You're

helping the world. Even though you don't have any

free time and it's hectic, somebody's got to do the job

and you're the best fit for this from the people I know

Page 50: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

for sure. Thanks so much for what you're doing.

Awesome.

Samuel Hinton: My pleasure.

Kirill Eremenko: Before we wrap up, where can our listeners find you?

What's the best places to contact you? Follow your

work? Get in touch?

Samuel Hinton: Let's see. I mean, LinkedIn is an easy one. You can

send me a message there. I don't check it often but I

do check it eventually. That's probably the best way

because most emails or whatnot that I get in the data

science route are just as easily done on LinkedIn. No

one really wants Instagram. I try not to do work on

that. Apart from that, yeah, hit me up on LinkedIn,

probably.

Samuel Hinton: If there are any urgent queries, feel free to send me an

email. Just know that I am incredibly swamped with

emails at the moment. I don't know if I'll have time to

respond in the next couple of weeks.

Kirill Eremenko: Got you. Sam's website if anybody is interested in

watching his lecture is cosmiccoding.com.au. Very,

very cool. Thanks again, Sam. Great, great pleasure

chatting on the show. Awesome as always.

Samuel Hinton: Thanks for having me, mate.

Kirill Eremenko: There we have it. Thank you so much everybody for

spending your time, investing your time into this

episode and learning alongside with us. I hope you got

a lot of valuable takeaways. Yeah, so much cool stuff,

so many cool things. Without a doubt, my favorite part

of this episode was all the things that Sam is

Page 51: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

describing about the COVID Critical Care Consortium

where he is the lead data analyst and all the

takeaways he's getting.

Kirill Eremenko: Also, the insights into what it's like to work with real

world data, how messy it is? What challenges come

up? I think that was a great refresher. Some projects,

especially if there are course projects or projects

prepared for you by somebody else can be too clean,

like too void like they might not have any messiness in

the data and anybody can be led to believe that data

science is like that. It's not. It's actually very, very

complex. There's a lot of missing data. There's a lot of

normalization that needs to happen. A lot of pre-work

of the data, building the data pipeline, all of that is

super valuable.

Kirill Eremenko: Speaking of data pipelines, make sure to check out

Sam's workshop at DataScienceGO Virtual.

DataScienceGO Virtual is happening at the end of

June, this year, 2020. You can get your ticket

absolutely free if you go to datasciencego.com/virtual.

Just be careful, number of seats is limited. This is our

first time doing an online virtual event. We've done this

event many times in real life in California for many

years. This is our first virtual event. The number of

seats is limited. Make sure if you want to get in, apply

for your seat today at datasciencego.com/virtual.

Kirill Eremenko: You'll see Sam running a workshop on data science

pipelines. You'll actually be able to code along with

him and create your very own data science pipeline.

Make sure not to miss that. As usual, you can get the

show notes for this episode at

Page 52: SDS PODCAST EPISODE 367: BUILDING DATA ......Kirill Eremenko: I think in 2015, they were going to launch the Netflix Prize number two, but then somebody wrote a research paper in the

superdatascience.com/367. That's

superdatascience.com/367. There, you'll find the

transcript for this episode, any materials we

mentioned, including books and the URLs to Sam's

LinkedIn, website, his presentations online, and any

other fun things that might help your learning growth

in data science.

Kirill Eremenko: There we go. Make sure to check that out as well. On

that note, thanks so much for being here. Sam and I

are looking forward to seeing you at DataScienceGO

Virtual in a couple of weeks. Apply for a ticket today if

you haven't yet. Until next time, happy analyzing.