View
0
Download
0
Category
Preview:
Citation preview
SDS PODCAST
EPISODE 367:
BUILDING DATA
PIPELINES FOR
COVID-19
MODELING
Kirill Eremenko: This is episode number 367 with Astrophysicist and
Online Data Science Instructor, Sam Hinton.
Kirill Eremenko: Welcome to the SuperDataScience podcast. My name
is Kirill Eremenko, Data Science Coach and Lifestyle
Entrepreneur. Each week, we bring inspiring people
and ideas to help you build your successful career in
data science. Thanks for being here today and now,
let's make the complex simple.
Kirill Eremenko: Welcome back to the SuperDataScience podcast
everybody. Super pumped to have you back here on
the show. Today, we're hosting Sam Hinton, who's
returning for the second time round. The first time he
was on this podcast was in episode number 303 in
October 2019 where we talked about hypothesis
testing and what it means for the world of data science
along with other topics. That episode was hilarious. I
highly recommend checking it out. That's SDS 303.
This one is going to be super fun as well.
Kirill Eremenko: Sam Hinton is always fun to talk to. He's got a great
personality and very outgoing and loves to share
things. I had a lot of laughs and I'm sure you're going
to have a lot of laughs along the way with us. What is
happening in Sam's life? What did we talk about?
Number one, very important, I think you're going to be
very interested in this is that Sam is the lead data
analyst for the COVID Critical Care Consortium, which
is one of the largest studies in the world right now
looking into COVID-19 and what is happening to
people who end up in critical care, things like
ventilation and other factors.
Kirill Eremenko: You will get a lot of interesting thoughts from a data
scientist who's actually working, like spearheading
this direction working with other scientists in over 100
or approximately 100 different countries around the
world. You'll find out what they're looking into. In
addition, Sam will talk about some of the challenges of
data like what are the real world challenges that data
scientists face?
Kirill Eremenko: Right now, he's facing all of this data that's coming in,
that is inaccurate or maybe incomplete. In many
cases, a thing that has to be cleaned up, has to be
normalized. Lots of pre-work on the data has to
happen and you'll find out how he's building this data
pipeline and what it means. We'll be talking quite a bit
about data pipelines. Very, very interesting. I'm sure
everybody can get a lot of value out of this.
Kirill Eremenko: We'll also talk about data modeling, Bayesian statistics
and DataScienceGO Virtual and how Sam will be
joining us to run a workshop there. Make sure to
listen in on that, that's going to be very cool and
maybe that workshop will be right for you. At the end,
we'll talk about astrophysics. You'll find out some cool
things about dark energy and dark matter. Super
exciting, super fun podcast. I can't wait for you to
check it out.
Kirill Eremenko: Without further ado, let's dive straight into it. I bring
to you, Sam Hinton, astrophysicist and online data
science instructor.
Kirill Eremenko: Welcome back to the SuperDataScience podcast
everybody. It's super fun to have you back here on the
show. Today, we got a very special guest, Mr. Sam
Hinton. Sam, welcome back. How are you doing?
Samuel Hinton: Thanks for having me again. It's always a pleasure.
Kirill Eremenko: Fantastic, man. Second time, how long was it the last
time? The last time was like, what was it, eight months
ago or so we chatted?
Samuel Hinton: I have no idea. More than a week which means I barely
remember it.
Kirill Eremenko: How have you been since then? Things going well?
Samuel Hinton: Things have been hectic. I'm sure that there's a lot of
people where they've lost their jobs and things aren't
hectic. A lot of other people in recent days who now
have 10 times the workload and don't know when to
sleep. I guess I'm lucky to be one of the second group.
Kirill Eremenko: Yeah. No, that's good. Why did you not know when to
sleep? What's happening in your world?
Samuel Hinton: I've got my normal job. I'm a postdoc at the University
of Queensland. I'm trying to lead the Dark Energy
Survey, Supernova Cosmology Analysis. Lots of fun,
astrophysics science that I'm supposed to be doing. I
am now also the lead data analyst for the COVID
Critical Care Consortium, which is as of writing right
now, I think, the largest international study on
COVID-19 in the world, specifically looking at things
like ventilation and all these stuff that we know is
quite difficult with COVID.
Kirill Eremenko: Wow, that's pretty cool. First of all, what countries are
in that consortium?
Samuel Hinton: We've got almost 50 countries now. I know the US
signed on weeks ago. Now that all the legal agreements
are in place, their hospitals are coming online. We've
got data from Estonia, from Kuwait, from almost, well,
a whole ton of European countries apart from France.
France has their own study and they're not joining
ours. We don't have any from Russia either. Almost
every other country is signing up. We've hit almost 50
countries, hundreds upon hundreds of different
hospitals sites. Soon, the data should start pouring in,
we hope.
Kirill Eremenko: Okay, but tell me like how did you get this job? Out of
all data scientists in Australia or in the world, I don't
know even, how did you get this?
Samuel Hinton: Mostly, luck. A lot of things in life are luck or just
being in the right place at the right time. It turns out
that in this giant collaboration, the data is hosted at
Oxford. Oxford and the parent company overseeing
this study are sort of like the big players. The
University of Queensland has an agreement with
Oxford. People were looking around at UQ for someone
who could do it. They had all these issues with the
data and they needed essentially someone that could
help out on the machine learning side of things, on the
visualization side of things, on the data pipeline side of
things.
Samuel Hinton: People are just going around saying, "Who has
experience?" One of the project, the head machine
learning investigator, talked to my supervisor and my
supervisor, my postdoc supervisor, said, "Oh, well,
why don't you talk to Sam. He's got previous
experience in all these areas." She talked to me and
then an hour later, she said, "Okay, I want you to lead
this." I said, "Oh boy, this is a lot. Are you sure?" She
said, "Yes." That's been my life for the past couple of
months.
Kirill Eremenko: Yeah, that's really cool. How does this all work? How
big is this whole team? Are you the only lead data
scientists? Are there other lead data scientists, like
other data scientists in different countries, that you're
responsible for Australia? How does this all work?
Samuel Hinton: Yeah. It's a bit of a complicated, I'm not going to say
mess, but it's a very spaghetti-like situation here,
simply because we're dealing with medical data. As
soon as you deal with medical data, there's a whole
bunch of confidentiality and privacy agreement, things
that you have to take into account. Because I was
essentially the first person that got access to the data,
they've now come down and said, "Look, we don't want
anyone else accessing the data."
Samuel Hinton: I'm one of the very few people in the world now that
can get the raw data. One of my jobs is to take this
raw data, run it through a data cleaning
standardization and de-identification pipeline that I
built. Then, distribute those data products to
specifically UQ researchers. We are getting other
universities in. We have, for example, in Brisbane
researchers from QUT. They get added to sort of UQ
system and then they can access the data.
Samuel Hinton: On top of that, we have companies that have reached
out and said, "Hey, we want to help with the machine
learning. We want to help with this or that." We've had
Amazon and IBM and we're working with both of them
right now and then Fast AI, a whole bunch of
companies. The big issue is that, it's sensitive data. It's
not something that you can just upload onto Kaggle
and have an open source Kaggle competition. You
can't do that.
Samuel Hinton: There's a lot of people that have offered to help and
they're simply unable to because we haven't got data
sharing agreements with them. We would be in very
hot water if we gave them the data. There are data
science ...
Kirill Eremenko: You de-identify the data?
Samuel Hinton: Yes. The issue is, even if it's been de-identified what
you normally have is then a security team comes in,
takes your de-identified data and they want to see if
they can break it, if they can re-identify patients. The
issue is, that step takes a little while to do. We've had
some preliminary groups at UQ say like, "These are the
variables that are quasi identifiers and if you combine
it with social media data, we may be able to re-identify
a person doing this or that."
Samuel Hinton: Until everything's like proven 100% good enough, it's
hard to even share de-identified data. We're moving
towards that section. Obviously, as soon as these
things come in and as soon as there are legal
agreements and everything in the way, it's no longer
just like a one or two-day task. It's back and forth
between legal teams and things slow down.
Kirill Eremenko: Have you heard of the Netflix Prize on Kaggle?
Samuel Hinton: There's a Netflix Prize? I haven't.
Kirill Eremenko: There was like years ago, years ago, when Netflix,
which was I think it's like 2000. Oh, my God, I don't
even remember like I don't remember that. Maybe,
years ago basically. Netflix went on Kaggle. They
posted like their de-identified data for people to have
this competition. The prize was a million dollars to
build a recommender system. Or the prize pool was a
million dollars to build a recommender system that
predicts the best way possible what movie you want to
watch next, what show. It was successful. They're
going to launch it.
Kirill Eremenko: I think in 2015, they were going to launch the Netflix
Prize number two, but then somebody wrote a
research paper in the US showing that he could
identify the people from Netflix Prize data by
combining parameters in certain ways. A lot of people,
I think, wanted or either did launch a class action
lawsuit against Netflix for that. How crazy is that?
Yeah.
Samuel Hinton: Yeah. It's definitely a place you don't want to take the
risk.
Kirill Eremenko: Yeah. Hope you're enjoying this amazing episode. I've
got a cool announcement for you and we'll get straight
back to it. Virtual Data Science Conference. Curious?
You've probably heard of DataScienceGO, the
conference that we've been running for the past three
years in Southern California. Maybe you've attended, if
so, it was super cool to have you there. Maybe you
weren't able to attend for the reason of being in a
completely different country, or the flights were too
long, or the timing wasn't perfect. There could be
plenty of reasons why you weren't able to attend. Now,
we're bringing DataScienceGO to you.
Kirill Eremenko: This June, we're hosting DataScienceGO virtually and
you can attend and get an amazing experience there.
Guess what, the best part is that it's absolutely free.
Just head on over to datasciencego.com and get your
tickets today. This will be our very first time running a
virtual event. Nevertheless, we're still going to combine
the three key pillars of fun, amazing talks and
networking into this event. You'll hear from speakers
like John Krohn, Sam Hinton, Hadelin de Ponteves,
Stephen Welch, and many others.
Kirill Eremenko: Plus, you'll be able to network with your peers. This
event is going to be epic on all fronts and we'd love to
see you there. Head on over to
datasciencego.com/virtual and get your ticket today.
The number of seats is limited. We'd love to have
everybody there. For our very first event, we're limiting
the number of seats to make it more manageable.
Make sure to get your tickets today, if you want to be
part of this. On that note, I look forward to seeing you
there. Now, let's get back to this amazing episode.
Kirill Eremenko: Yeah, okay. All right. You get all this data. What is this
data science pipeline? Tell us about it. Of course, by
the way, for everybody listening, none of this is
medical advice. We're going to as much as possible
avoid ... we are going to avoid sharing sensitive
information that Sam cannot share on this podcast.
Most importantly, none of this is medical advice. If you
hear anything like the coronavirus, it's opinions only.
With that caveat, what's this data science pipeline?
What goes into this process of building one? Why do
you need one in this specific case?
Samuel Hinton: Right. There are a few things to keep in account when
we're talking about this specific study. The first is that
the database in the system where the data is gathered
was written by a very smart and very talented UQ
researcher. I won't give you the name because I'm sure
... Respect. I want to respect his privacy and people
will end up emailing everyone over everything. He ...
Kirill Eremenko: I'm sorry, just to add. UQ is University of Queensland
in Australia ...
Samuel Hinton: Yes.
Kirill Eremenko: One of the top universities in Australia.
Samuel Hinton: Yes, good clarification. I tend not to define my
acronyms straight, that comes from my astrophysics
roots. He's made this great database and it's used for a
whole bunch of different medical studies, including the
one that I'm working with which is the COVID Critical
Care Consortium. It was originally named
ECMOCARD, if that rings a bell to anyone listening.
Samuel Hinton: What it means is, it was a very generic way and the
doctors go and they get CRFs. Essentially, they print
out a whole bunch of sheets of paper where they write
down the details, and then someone goes it through
and uploads it into this database. The issue is that
there's very little checks done on the database. It was
written to be general by a person by himself essentially
and a long time ago too.
Samuel Hinton: It wasn't written this year with all the modern
frameworks. It's a fairly old system. That means that
when the data comes back, we have very little
guarantee on what the data should look like. Dates
don't have to be dates. Numeric values can come
through filled with strings or letters. That's the easy
part to identify because at least we know things
should be numbers. If it comes through, we have for
example, 107, but the O in a 107 is the degree symbol
from degrees centigrade.
Samuel Hinton: There's a lot of weird issues like that simply because
we're mixing European keyboards and non-European
keyboards. Then, even when you get that down, there
are things that haven't been validated. Like, we want
to take patient records every day so that we can track
the evolution. Sometimes, we have two or three
records for the same day. It's like, I've entered the date
row, but there's no validation on that.
Samuel Hinton: Then, even if you do all of that, you now have
hundreds of hospitals from dozens of countries and
they used different units for everything. You get a
whole range of numbers come in. For a lot of the
cases, you know what the units are and you just do
basic unit conversion. Some of the fields don't have
unit as input. You have to try and infer from the
ranges what the unit should be. That's tricky to do
because in medical data, things like lymphocyte
counts can span four orders of magnitude in a living
patient.
Samuel Hinton: How are you supposed to deal with that? Then, on top
of all of that, because the data is filled in from
someone writing down off a piece of paper, it's highly
incomplete around 80%. If you just turn this into a 2D
data frame, around 80% is missing. That's a huge
problem for imputation especially because in this
current, like right now, when I'm talking, we don't
have that many records.
Samuel Hinton: We have hundreds of patients and less than 100 if we
count just those where we know whether they were
survived and discharged alive, or whether they didn't
survive and succumb to coronavirus. With such little
data, how are you supposed to do effective imputation?
We have to have multiple strategies that we then need
to try and vet. All of that needs to happen every day.
We download the data, clean the data, standardize it,
and try out a bunch of things every single day so that
we can go back to the ICUs, back to the hospitals
when we need and say, "Hey, I think you've put this in
wrong here or maybe this is a really cool, really
interesting novel result."
Samuel Hinton: All of this needs to happen in a very quick, a very
automated fashion to make sure that we can get the
results back as quickly as possible.
Kirill Eremenko: Wow. I can just imagine the doctors, it's like a
battlefield for them. They're running around trying to
save people's lives. The last thing they care about or
last thing that's on their mind is to sit down and
properly, carefully input all the data for Mr. Sam
Hinton in Australia.
Samuel Hinton: Yeah. It's a difficult sell, isn't it? Because especially
writing things down on a piece of paper that someone
then has to copy in. It's just not an efficient way of
doing it. It's one of the things that Amazon reached out
and we said, "Hey, you're well-suited to this, gathering
data and using it." Obviously, there's a whole bunch of
privacy concerns when you decide to bring in a large
corporation.
Samuel Hinton: There's all the legal issues there where we have to be
very careful about whether or not they can actually
have the data at the end. The answer is, no, right?
They will help us gather the data and then the idea is
they don't get access to it. Then, it's okay, do we
develop an app? Do we try and set up Alexa so that the
doctors can simply read out the values into their
phone and it will populate the form for them using
natural language processing.
Samuel Hinton: There's a whole bunch of concerns there but even
then, the doctors don't have time to go through and
even just read out what the values are. Countries like
Germany, those that haven't been massively afflicted
yet as in those that haven't broken through their
capacity in the hospital system, are doing things like
getting student doctors and med students to go out to
the hospitals. They just drive from hospital to hospital.
They take the paper and they enter it into the
databases.
Samuel Hinton: They go around and pick out the values and record
them and enter it. There's simply in many countries,
absolutely, no chance going to get the people that on
the frontlines trying to keep people alive to take a little
break from that, to do some data entry. It just doesn't
make sense.
Kirill Eremenko: Yeah. Yeah. Okay. All right. Once you have all this
data processed and cleaned, what you're saying is, you
have to do this every day and there's no way for you to
automate it completely that all these checks happen
automatically?
Samuel Hinton: Yeah. At the moment, every day we regenerate our
data products. Every day, we regenerate a new list of
issues that we then go back to the clinicians with and
say, "Hey, by the way, this value from this day looks a
bit funky, can you please double check that or is that
a legit value?" Obviously, that gets sent back, not
every day. We don't want to overwhelm the sites but
the more egregious areas, the things that we can't fix
ourselves, gets sent back. Essentially, the only time we
do send them back is when we're losing data.
Samuel Hinton: There are some fields that we simply need. For
example, when was the patient admitted to ICU? We
want to track their evolution over their stay in the ICU,
which means, if we don't have when they were
admitted, we don't know at what point they fall in the
timeline, and we can't use that data. It seems a
shame, like all we need is this one variable for you to
fill out and then, we can use the 400 other variables
that you put in for this patient. Then, we go back to
them. Then, the other thing that they want is every
day, they don't just want issues. No one wants just
bad news every day.
Samuel Hinton: We generate daily reports for them. This simply comes
down to some Jupyter Notebooks essentially that are
automatically generated and converted to HTML
documents and they will have interactive plots, where
we can show them basic statistics and demographics
of their patients. We know that COVID-19 affects more
men than women and it affects older people worse
than they affect younger people. We want to keep up-
to-date. The statistics that they have saying, okay,
what are the risk factors? Is arterial hypertension, like
high blood pressure, a large risk factor or not? How
about smoking? How about obesity, diabetes, all of
these different things?
Samuel Hinton: Then obviously, treatment. One of the big questions
which are, what is the difference in outcome when you
look at different treatments? Who are on antibiotics?
Who are on antivirals? Which antivirals are they on?
Which antivirals have different ratios of success versus
failure? All of that is data that we try and generate
every day into an HTML document that we can then
send back to the clinicians.
Kirill Eremenko: Interesting. You said you don't have that many, like
you already have hundreds of patient records that are
complete and you know the full story, wouldn't those
insights be statistically not significant? Like, if you're
inferring that?
Samuel Hinton: Yeah. It's a big problem, which is that in some cases,
we have the numbers. If you're not looking at the
outcome, if you just want to say, "Hey, what are the
demographics of people that are being admitted?" You
don't know the outcome. We can say some things there
but then, we can't say a lot. This is one of the reasons
why we have to be very careful about what we give to
clinicians and medical doctors because we don't want
to mislead anyone. We don't want to cause any
unnecessary harm.
Samuel Hinton: We have, for example, some clear trends in some, let's
say, pH. Your blood pH levels were broken down into
those that survived and those that didn't. At the
moment, the trends look very different but we can't
give that to the clinicians because if you look under
the hood, at the number of patients that are being
used to generate those trends, it's a tiny, tiny number.
If we give that away, we aren't confident that we've
accounted for the difference in country and ethnicity
and all these factors that differ across the patients
because we don't have a representative sample.
Samuel Hinton: It's something that we always have to keep in mind.
There are currently a whole bunch of data products
and things that we are simply hiding so that we can
see them when we're developing products like the
dashboard, like the daily report. We can't make them
public because the chance that they would mislead
people is simply too high because without the
knowledge of statistics that would help inform the
validity and the confidence of those trends, it's very
easy to make a mistake.
Kirill Eremenko: Got you, man. I'm so glad that you're doing this. Out
of all the people in the world because I remember in
our first podcast, you stressed very strongly that given
the 95% rule for frequency statistic, the P value of 0.05
is not sufficient. That means, like ... this was your
quote and I've used it many times, 1 out of 20 research
papers out there is wrong. Every 20th research paper
is incorrect simply because we agree that 5%
confidence is sufficient.
Kirill Eremenko: I can just imagine how rigorous you are about not
misleading people and misleading doctors here would
cause people's lives. You have to be very careful.
Samuel Hinton: Yeah. It's so easy to happen because I think we have
around 450 variables. Imagine if 1 in 20 of them are
wrong, and we've drawn conclusions, that's almost 20
different hypotheses that we could incorrectly give if
we just decided, "Hey, P value 0.05, good enough, ship
it out." You'll notice that if you look through all the
papers that are currently being published on COVID-
19 and especially some of the early ones, they're done
on a cohort studies of three, four or five people.
Kirill Eremenko: Five people.
Samuel Hinton: Yeah. There was a study in The Lancet with a patient
count of five. It's like, okay, well, it's good. You got to
get these things out. There's no time to sort of dilly
daddle on it but at the same point like can we trust it?
You don't know.
Kirill Eremenko: Yeah. Yeah. Interesting. Okay. What happens next?
You do these reports back and forth, how's the
workflow? Do you guys like have meetings? Do you ... I
don't know. Do you have like a vision? Is there some
leadership?
Samuel Hinton: I have had around six hours of meetings today. Yes,
there are meetings. There are meetings between the
different companies we're trying to help out. There are
weekly meetings with the PIs and myself to try and
determine directions.
Kirill Eremenko: What's a PI?
Samuel Hinton: The project investigator.
Kirill Eremenko: Okay.
Samuel Hinton: One of the people leading the project. There are
meetings every Thursday with the clinicians. There are
meetings every Friday with the UQ researchers, who
are trying to apply models onto it. In terms of where to
go in the future, hopefully less meetings. I can't see
that actually happening. There will always be too
many meetings. What we want to do is, once we get
more data, once we can actually be a little bit more
confident in the results that we're getting, hopefully,
we can do some interesting things with it.
Samuel Hinton: At the moment, we've been doing things like
generalized linear models and Cox regression and a
bunch of other little statistical tests to try and answer
some of the queries that the commissions have. The
other thing we want to do is use unsupervised learning
to see if we can cluster the patients because one of the
current questions with COVID-19 are, are their
separate phenotypes. Are there multiple variants of the
virus going around? Do they present differently?
Kirill Eremenko: Like mutations?
Samuel Hinton: Yeah, essentially. Yeah. There's been some marginal
evidence so far published in papers that yeah, it looks
like there might be multiple phenotypes. We want to
see, can we cluster our results? Do our results
indicate that there might be subgroups? Again, hard to
do with only a few hundred records. Then, we also
want to figure out things like causal modeling. This is
obviously a big issue especially in the medical field
which is, let's say, you notice a trend in some sort of
variable. We don't know is that trend because of
COVID? Is that trend because of the medication? Is
that trend because of any one of 400 billion different
things? You're not quite sure.
Samuel Hinton: You want to see if you can construct causal nets to
determine exactly how the conditional probabilities in
your model lies to see what is actually driving these
trends. Of course, it's extremely complicated especially
in medicine where each patient gets treated
individually. They get treated based on how they're
presenting. It's not like you have a control group that
just gets run through with the same treatments in the
same way. If someone presents differently, they get
treated differently and it's so hard to standardize the
results.
Kirill Eremenko: How are you going to do that? That's a very important
question not just in this application, but in other areas
of life whether it's business or marketing or product
supply chains, like whether even, there's always going
to be these external factors. As we know, correlation
doesn't imply causation. Do you have any tricks you
can share that you think might work?
Samuel Hinton: The trick that we're trying to rely on right now is one
that isn't applicable to anyone else, which is we're
going back to the clinicians. There are obviously
hundreds of years of medical advice and medical
studies out there that we can try and make use of to
say in other different ... if you don't take COVID, if you
take flu or SARS, like viruses, how do they normally
present? What are the non-causal features more so in
those different pathogens or viruses?
Samuel Hinton: On top of that, I can't think of a good way to explain it.
It mostly just comes down to being a very thorny issue
that we haven't fully solved. There is no ideal solution.
Obviously, you can use clinicians to help inform that
but you can also just use a bunch of different models.
One of the things is, we want consistency in our model
outputs across models. You don't just want to run
some stupid random forest to get a result and just
ship it out. You want a whole bunch of different
models to agree so that you have confidence in it.
Samuel Hinton: Then, you want to use explain ability and
interpretability techniques so that for every model that
you've done, you can actually identify why that model
is saying the things that it does. This is something like
Shapley values or just looking at the weights of each
decision tree in a decision tree. What are contributing
to the final answers? So that you can try and hopefully
get a consistent idea of the causal effects in all of your
models. You hope that they agree.
Kirill Eremenko: Yeah. For that purpose, do you think a neural network
could work?
Samuel Hinton: It could, it could and we will have neural networks
especially with the patient evolution. Our time series
data that well-suits a recurrent neural network,
something like a long, short-term memory network.
The main issue is, we can't train anyways at the
moment because we only have a few hundred data
points especially...
Kirill Eremenko: I mean ... Sorry.
Samuel Hinton: ... we're going to like...
Kirill Eremenko: I was going to say ...
Samuel Hinton: I was going to say ...
Kirill Eremenko: You go.
Samuel Hinton: I was just going to say, for an LSTM networks or a
deep network, you need a lot of data points and it's
very difficult in our case to create new data. Data
augmentation techniques are very difficult to do on
data that is mostly incomplete. It's difficult for us to do
something like a nearest neighbor imputation because
the dimensionality of our models is in the hundreds
and we only have hundreds of data points. Your
nearest neighbor may be a very great distance in
hyperspace from you. It makes it difficult because you
need to imputate. Then, you need to try and augment
your data without biasing your models.
Kirill Eremenko: Okay.
Samuel Hinton: How to do that with only a few hundred samples for a
novel disease? That's tough.
Kirill Eremenko: Tough. For neural networks, in terms of
interpretability, like, even if you got a neural network
that predicts everything well, assuming you solve
somehow the problem of a low, the small data set, it's
really hard to interpret why exactly, why are these
neurons behave in a certain way? Wouldn't that be a
roadblock to using neural networks for this problem?
Samuel Hinton: Yes. No, for sure. That's why we're trying to get as
much expertise as possible to come in, people that
have done the similar things before. I know you've
seen things like with convolutional neural networks.
There are ways of breaking down the features so that
you can try and visualize them. Essentially, we need
techniques like that that apply in general. It's very
hard to deal with the neural network, especially as the
depth starts to increase. Even if you try and visualize
what neurons are lighting up, like how do you put that
into something that a human can understand?
Samuel Hinton: It's just a massively complicated linear algebra
function, which we have essentially no intuition over
and it's difficult. While some potential partial solutions
exist for some specific variants of neural networks like
CNNs, I'm not sure, I don't know of a generalized
solution. If someone out there listening to this knows a
generalized solution to neural network interpretability
and explainability, please let me know.
Kirill Eremenko: That's all. Okay, got you, for sure. If anybody listening
has any ideas, I think at this stage that will be very
useful. We'll share Sam's contact details in the show
notes as if he's not getting enough meetings already.
Samuel Hinton: You should have seen, I did an interview with ABC
Radio National last week and it went live two days ago.
I have been flooded by well-meaning people offering
support. I had one lady that come in and say, "Look,
I'm retired. I'm just isolating in my home in the Blue
Mountains. I have nothing to do. I'm an ex-researcher
in agricultural science, have a statistics background
as well. Do you need a personal assistant to help you
manage all this?"
Kirill Eremenko: Wow.
Samuel Hinton: I was completely flawed by her response and all the
other positive responses we have received. I said no to
her, of course, because the university actually listened
as well when we said we're drowning and we now have
a new project manager, her assistant, a new
administrative assistant on the UQ side of things.
Luckily, we are getting the support that we need. Still,
the response is large.
Kirill Eremenko: That's awesome, man. That's awesome. You are doing
a fantastic job like this work can potentially help stop
this or slow it down. Hats off to you, it's really cool,
really cool. The university might be listening to this
one as well, is there anything else you need? Let's do a
wish list.
Samuel Hinton: A wish list. I wish I could get into America and start
the job that I accepted many, many months ago. I got
offered a very nice fellowship at Lawrence Berkeley
National Lab. I'm supposed to have had my visa
interview and everything planned to fly over there with
the wife and all canceled indefinitely. Who knows, I'll
be at UQ for the foreseeable future and we'll see how
long COVID takes to be consigned to the pages of
history.
Kirill Eremenko: Man, you got married. When did you get married? I
completely, sorry, I missed that.
Samuel Hinton: April 1st.
Kirill Eremenko: Wow. Congrats.
Samuel Hinton: That is our anniversary and we decided it's the best
date to get married because half my friends on
Facebook, especially because of COVID-19, the
ceremony is limited to the celebrant, two witnesses
and me and my wife.
Kirill Eremenko: Yeah.
Samuel Hinton: Five people. It's not like a big affair. When I posted
pictures saying, "Hey, by the way," a lot of people
thought it was a very elaborate joke, which I
encouraged for a solid week until I was like, "Yeah, no,
it's actually real." That was the best I think. I saved so
much money on the ceremony. So much money on the
reception. The honeymoon was a bit lackluster. We sat
down, open Google Maps and just went through Street
View in a few countries. We're like, "Yeah, those look
nice. We'll visit them one day."
Kirill Eremenko: Wow. Okay. How long have you been together?
Samuel Hinton: A while. It's a very short marriage. I think we met 2018
at the end of it, maybe. I'm not quite sure. My memory
is horrible. If you ask her, she will know like the exact
date, the exact time and exactly. Me, I'm just like a
couple years ago is fine.
Kirill Eremenko: Yeah. Awesome, man. Congrats. That's really fun. Very
cool. Awesome. Okay. Hopefully, once this whole sells
down, you'll do your work there. You've got your PhD,
right? This is your postdoc?
Samuel Hinton: Yes. I've got the PhD, but in a month where the
COVID-19 stuff it still hasn't been on boarded. It was
like submitted and sent off for review so long ago, but
everyone has far better things to do. I'm currently
sitting here, not as a doctor [crosstalk 00:35:44].
Kirill Eremenko: PhD list.
Samuel Hinton: ... yeah, PhD list working like two different postdoc
jobs, pulling my hair out, making courses on the side,
just waiting for my reviewers to eventually come back
and say, "Yeah, it's all good."
Kirill Eremenko: Got you, man. Wow.
Samuel Hinton: They just need to get off your asses. Get me back my
thesis. I [crosstalk 00:36:03] that.
Kirill Eremenko: Got you. Okay. Thank you for the run down on
COVID. Hopefully, things go well there and we all
support you. I'm sure our listeners, please show Sam
some support, send him some nice emails if you are
supporting him. Even if you can't do anything to help,
it's good to know you're listening.
Kirill Eremenko: Speaking of courses, congratulations on launching
your second course, man, like number two. First one
was Python for Statistical Analysis about six or so
months ago. Second one now is, Python for Data
Manipulation. The irony is that's exactly what you're
doing for COVID.
Samuel Hinton: Yes. I'm lucky that everything just fits in so well
together. Yeah, I thought after seeing all the comments
on the stats course, that the biggest skill that the
people taking my course were lacking was the ability to
use libraries like pandas to streamline all of the pre-
processing stuff in their analysis because no one
wants to spend 13 hours crunching numbers to do
half an hour of machine learning or statistical
analysis. I was like, "You know what, okay, pandas it
is. I'll make a crash course for that. Show everyone the
easiest ways and the most efficient ways of doing all
the common tasks."
Samuel Hinton: I hope it's been useful for those that have signed up.
I've got some good reviews so far. Some people have
reached out and said that they really liked it, which is
always pleasing to hear because you don't want to
make it. You don't need to get people to come back
and say, "That was terrible."
Kirill Eremenko: Yeah.
Samuel Hinton: I'm happy. They're happy. I think we're all happy in
isolation.
Kirill Eremenko: That's good, man. Yeah. Just speaking of reviews, you
have some of the highest reviews we've seen across all
the instructors we work with. Both of courses have 4.6
stars or 5 stable, which is really hard to maintain on a
massive platform like Udemy. What's your method?
How do you do it? Maybe, there's people out there
looking to create a course these days, like maybe you
can share some insights. How do you get such great
feedback all the time?
Samuel Hinton: Honestly, I'm not too sure. I think one of the things is,
I keep in mind what I want as a student, which was a
few years ago now. I always remember listening to my
lectures, the online recordings, and just wanting to
un-enroll. Some of them would go on and on, if I stop,
I really didn't care about pages upon pages of just
talking at me before getting down to anything useful.
Samuel Hinton: I decided if I ever made a course, it wouldn't just be
talk. I mean, I would talk about the code that I'm
writing in front of you and try and keep it practical so
there's something for you to do whether it's run the
code in parallel with me or just read over what I'm
reading or listening. Just not droning on. I try and get
to the good stuff quickly, but lectures always end up
taking far longer than I thought.
Samuel Hinton: I remember I recorded a lecture just about histograms
and the first record, it took about three minutes.
Histograms are pretty simple. There's not much to talk
about. Then, I got a ton of questions for people. They
were saying, "Hey, what about this use case? What
about this here? My code isn't working here." I
realized, even with such a simple concept, there are a
whole bunch of little caveats or things that people
don't quite understand intuitively.
Samuel Hinton: I went back and re-recorded it. It became like a 15-
minute video, but people seem to like it. Those that
already knew were able to watch it at double speed
and sort of skip to the parts that they needed. Those
that had never seen it before managed to get all the
relevant information such that they didn't try out the
code to get an error and then have to hit up Stack
Overflow for half an hour afterwards, trying to figure
out what on earth this keyword that they needed
meant. I'm not sure. I just try and keep that in mind
but beats me.
Kirill Eremenko: Yeah, man. That's a good approach to overdeliver
because I actually met ones in one of our live events, I
met a student and she told me like, "Oh, it's so weird
to hear your voice in real life because I've been
listening to you online all the time and you sound so
different." I'm like, "How do I sound so different?" She
said, "I listen to you on double speed. I've never." Yeah,
so a lot of people do that. I encourage people to do
that. I would rather put more into a lecture than less
because you can just listen on double. If there's less,
then people who are not as familiar with the topic,
they will fall behind and we don't want that.
Samuel Hinton: Yeah. There's a personal side of it too, which I'm not
sure if this is of too much interest for those listening. If
you're making a course, so we release the stat course,
the Python for statistical analysis course for free, for a
little while during COVID until we had to stop making
it free because in that one week, we did free, I got
42,000 new students. That's more than being the
course the entire time it's been up. A huge amount.
The issue was okay, well, I have two jobs at the
moment, plus the newly released course and now I
have 42,000 students who ask questions.
Samuel Hinton: Even though they got the course for free, I'm not going
to ignore their questions. I'm going to go in there and
answer them to the best of my ability, but it takes
time. If you have these short lectures that you think
are like, "This is really efficient," 30 seconds and this
topic is done. You haven't been comprehensive, well,
people will just ask you about the things you haven't
covered. Suddenly, instead of spending 10 minutes
recording an additional one minute in your lecture,
you're spending 10 hours responding to the same
question 400 times and it's just not efficient.
Kirill Eremenko: Yeah. I agree. Tell me this, so once you recorded in
this recent course, data manipulation of Python, you
said pandas. Are you using pandas for the COVID
analysis or are you using some other tool?
Samuel Hinton: Yes, no, we're exclusively using pandas essentially for
the data pre-processing step, to the point where I can't
think of a single function in amongst the pipeline that
I threw together that doesn't make use of pandas in
some way. Date times are best handled in pandas.
Pandas has categorical features, which is amazing.
Everything is pandas. That's not going to change. It's
just such a convenient tool.
Kirill Eremenko: That's so cool. It's a very vivid example of practicing
what you preach. I love that it happened in this order
that you first recorded the course about pandas and
data operation. Now, you're actually using those same
tools. It's a great testament to that these are applicable
tools in industry, in medicine, in whatever, like
emergency situations like this, you know these tools,
you go and use them right away. Really cool. Really
cool.
Kirill Eremenko: What else that I wanted? What I wanted to ask you
about is, I've had this question, so since our last chat
I've been killing myself over like breaking my head
because I didn't ask you. I was so like tempted
afterwards like I should have asked this question.
We're talking about Bayesian inference. We're
comparing Bayesian inference to frequency statistics,
Fisher and his thing. By the way, I don't know if the
listeners listening to this, I actually read that Bayes
was the 19th century and Fisher, as far as I
understand, Fisher was like 1920s, like early 20th
century.
Kirill Eremenko: Fisher didn't like Bayes, right? He created his own
approach to statistics and so on, which we all use now
and which is taught at school the P values and things
like that. Fisher actually, interestingly enough, he, as
from what I read, he tried to prove that not smoking
causes cancer, but cancer causes smoking. Speaking
of the correlation causation, like because it's a P value,
right? The chart is there. You can run all the tests.
You don't you know, you have to have some additional
knowledge to know which way it works. That's like a
side story.
Kirill Eremenko: The question I wanted to ask you, you were talking
about Bayesian inference and you were talking about
prior probabilities, posterior probabilities, if I'm getting
the names right, and how things are interpreted. You
give this lovely example, by the way, I highly
recommend to listeners to check out the previous
podcast, I'll dig up the episode number and we'll
mention in the show notes. Fantastic episode and you
gave us a great example of like the sun exploding.
Kirill Eremenko: You said, "Okay, so if we have this device on earth that
is looking at the sun and predicts all of a sudden that
the sun's going to explode in the next hour, we can
take all of the prior probabilities. We've seen that the
sun hasn't exploded like prior knowledge that we had.
Sun hadn't exploded in billions of years. Most likely, if
we account for that, then the probability actually goes
down quite a lot." Right? Do you mind repeating or
something like that, the example?
Samuel Hinton: Yeah. The example was there's a box on earth, where if
you push it, like you push a red button on the top, it
will tell you whether or not the sun is going to explode
or not in the next 10 seconds. When you press the
button, what it does is it tosses two dice. If you get one
on both the dice, it just tells you that the sun is going
to explode. Otherwise, it returns the truth. The
frequentist walks up, presses the button, gets
unlucky, it rolls two ones.
Samuel Hinton: They know about the dice, but they say, "Two ones,
that's 1 in 36 chance. That's less than a P value of
0.05." We have a significant result that the sun is
about to explode and they run off to publish. In the
background, the Bayesian statistician is just sitting
there shaking his head trying to bet that it won't.
Kirill Eremenko: Yeah, because he is using or she is using the Bayesian
inference, right, which takes into ...
Samuel Hinton: Yes.
Kirill Eremenko: Can you tell us a bit about this prior probability?
Samuel Hinton: Right. Bayes' theorem is that the posterior, which is
the likelihood, the whole likelihood.
Kirill Eremenko: The end result.
Samuel Hinton: Yeah, so the end result is a combination of likelihood.
Yeah, you're right. I shouldn't say likelihood when I'm
talking about the posterior. It's a combination of the
likelihood and the prior. The likelihood is, what is the
chance of getting the data, given our model. Then, the
prior is what's just the flat chance with our current
information of that model? If you combine all of those,
it's proportional to the probability of your model given
the data, which is different to the likelihood.
Samuel Hinton: If I speak the math very quickly, it's saying that the
probability of theta given D is proportional to the
probability of D given theta [inaudible 00:46:49] times
by the probability of theta, theta being the model. The
idea is that in frequency statistics, we work with the
likelihood and that's fine. That's good. That's what you
need to do. Then, when you look at Bayes factor, you
also add in the prior, which is your prior, past,
existing information.
Samuel Hinton: There is another part, the whole thing is a fraction. On
the bottom is what we call the evidence. Let's not get
into that, because that's a much more conceptually
difficult thing to talk about without actually having
diagrams or being able to write any math. Posterior is
proportional to the likelihood multiplied by the prior.
Kirill Eremenko: Got you. What I was wondering since then, like
literally, I think we hung up and this question popped
to my mind or it was like towards the end of the
podcast, we're running out of time. Anyway, so do you
know the turkey paradox?
Samuel Hinton: No, no, you're going to have to tell me.
Kirill Eremenko: It's simple. It's super simple, but it's not mathematical.
It's got nothing to do with Tukey, the mathematician
with the, what's it called, T test, I think. It's just about
a turkey like an animal. The turkey is born. It's a bit
scared of everything at the start. Then, the farmer
comes along or whatever. The butcher comes along
and feeds it some corn and say, "Wow, I got some corn
from this butcher. That's amazing. Okay, well, maybe
we can be friends. What's the likelihood of somebody
giving me corn for free?"
Kirill Eremenko: All right? Then, day passes, two days, three and every
day he's getting corn. Then, maybe a month passes,
six months, a year, I don't know how long turkeys are
raised for. Every single day, he's getting this corn. The
prior that's part of the probability, yeah, the prior,
right, it's growing. It's like, "There's evidence that he's
my friend, he's my friend." If you apply Bayesian
inference, the probability of the butcher butchering the
turkey in the turkey's mind is going down all the time
because all the evidence it's seeing is like with the sun,
it's not blown up. I haven't been hurt by this butcher.
Kirill Eremenko: I apologize to the vegans out there, but at some point,
the butcher comes and slaughters the turkey for
Thanksgiving or for some other thing. This really mess
with my head, just like you have the sun example on
one hand, but with the turkey example, the whole
Bayesian inference goes down the drain, because the
result is inevitable, like it's going to happen. I wanted
to get your thoughts on that. How do you apply
Bayesian inference? Or what does that say about
Bayesian inference?
Samuel Hinton: Nothing. I mean, in this case, the Bayesian inference is
perfectly fine. On any given day, the turkey is very
likely not going to die until the day that the butcher
decides he's had enough with that thing gobbling up
all his bread. That means that Bayesian inference has
served the turkey correctly for everyday but one, which
is a lot of being served correctly.
Samuel Hinton: It's only an issue in our heads because we know that
the butcher is coming. We have access to hidden
information. Our Information is different to the turkey.
We see this and we go, "Oh man, the probability of the
turkey being butchered is so low from the turkey's
point of view." It's like, "It is," from our point of view.
We know it's coming. That's just because our priors
are different. It serves the turkey well, like as
everything does, until it suddenly stops working. If the
turkey lives for a few years, it served it very well for a
very long time.
Kirill Eremenko: Got you. Our priors are different. I think that's a key
that we've seen hundreds of millions of turkeys prior
to that and we know that their result 99.999% is this.
Samuel Hinton: Yeah, precisely. Our conditional probability is
conditioned on the knowledge that we know the
butcher is going to butcher.
Kirill Eremenko: Yeah. Interesting. Yeah. That's a good feature of
Bayesian inference. Not a feature, the more knowledge
you have, the more accurate your prediction will be.
On that note, do any industries or businesses or I
don't know, applications actually use Bayesian
inference these days? I've heard of a few, but what's
your knowledge in this space? Because it looks like
everybody is using frequency statistics, whereas
Bayesian has a place as well.
Samuel Hinton: I think it's difficult to say because it's easy to mix
things up. One of the giveaways of frequency statistics
is when someone starts talking about P values. We
generally don't do that in Bayesian statistics. If I use
Bayesian statistics and I calculate some variable, I
would say that, X has been detected at 3.8 sigma
confidence or similarly. I'll say, you would use a P
value in the frequency statistics, but that doesn't
mean that someone using frequency statistics can
incorporate prior knowledge.
Samuel Hinton: They just do it under a different formalism. If someone
says, "Hey, are we doing Bayesian technique?" I need
to sit down and say, "Okay, well, how have you
formulated your model? What prior information do you
have? How is that being incorporated?" It's very easy to
try and sneak that information into the likelihood.
That's fine to do in some context, but it does give you
certain different mathematical properties of your
outputs. It's hard to say.
Samuel Hinton: A lot of the cases, you do use Bayesian-like techniques
or using prior information almost everywhere you go.
Every time that you've done something with deep
learning, you may not have run with a Bayesian
neural network, which are things and they're
wonderful and you should check them out. The fact
that you've trained on 10 million images means that
you're incorporating prior information already. It's just
not under the formalized Bayesian statistics headline.
It's sort of blurred line that's difficult to actually draw
in the sand.
Kirill Eremenko: Okay, okay. Interesting. Thanks. Thanks for the
rundown. You should create a course on Bayesian
inference. I'll be happily take it to learn ...
Samuel Hinton: Yeah. I thought about it, modeling, how to fit models,
whether you're using different MCMC, so Markov
chain Monte Carlo processes or similar. Because
everyone has a model and it's a lot easier to write a
model than it is to correctly fit it to the data and draw
the right inferences from it. That's what I do a lot.
Maybe one day when there's enough interest, I'll write
up and record a course on that. It all depends on what
people want. Everyone get kin for model fitting and let
me know.
Kirill Eremenko: Speaking of what people want, we have a very cool
surprise announcement. Sam is joining us as a
speaker on the Advanced Day at DataScienceGO
Virtual. DataScienceGO Virtual is happening at the
end of June or start of July. We're still deciding on the
date. By the time this goes live, you can find out the
date for sure it's available at datasciencego.com. Get
your tickets there.
Kirill Eremenko: Sam will be joining us as a speaker. What will you be
talking about? Actually, a workshop. What workshop
is it going to be?
Samuel Hinton: Probably something on data science pipelines. Given
that I've spent the past few months writing a few of
these. Years before that, I've been doing them for an
astrophysics context. It seems smart to finally
formalize that and write up a workshop. I've given
workshops in the past on different topics so it'll be
good to add another knock to the belt and finally write
up everything that I've been doing on this.
Samuel Hinton: The idea is quite simple, I think, which is, every data
scientist doesn't want to be doing data cleaning. No
one particularly likes cleaning and standardizing data.
How can you write a pipeline? The easiest, most
flexible and most extensible way possible to streamline
all of that to get you either your data products as
quickly as possible or to go through generate not just
the data products, but then do the machine learning
and validation on them to get you not data products,
but now machine learning products or business
intelligence products at the end.
Samuel Hinton: The less time people spend sort of screwing around
playing with the data, the more time people can spend
actually getting insights from the data.
Kirill Eremenko: Absolutely. The way this topic came about is that we
ran a survey and over 1,700 people interested in
attending DataScienceGO Virtual completed the
survey. Among other advanced practitioners,
specifically, the most popular topic was data science
pipelines by far. The next topics were still popular, but
there was a huge, huge difference in the number one
and number two topics.
Kirill Eremenko: Why do you think data science pipelines is so in
demand right now among specifically advanced
practitioners?
Samuel Hinton: Exactly like what I said, no one wants to spend their
time doing it. Data scientists spend most of their time
not doing data science.
Kirill Eremenko: Cleaning data, right?
Samuel Hinton: Yeah. It's an awful waste of time. It's not a fun job. It's
not a rewarding job. You get a data product at the end
and now you can start your real job, just getting
insight and crunching the models down to actually
extract useful information and being able to do
something. Like we've done with the COVID study,
where every day we get a data refresh and that
happens automatically at 6:00 a.m. It's kicked off, I
don't do anything.
Samuel Hinton: Then, at the end of that pipeline and it takes about five
minutes to run, we have data products that have been
uploaded to secure sites. We have reports available for
people. We have an interactive dashboard. This isn't
currently in but I have the data, we use to take it out.
We will have machine learning products that have
been refreshed each day because you don't want to go
through and say, "Hey, we've got a new data set. We've
got a few extra records."
Samuel Hinton: I'll just manually run these 30 models that are thrown
together and re-compare them. It's like, no, you want
to press a button, go off, have a little nap, come back
and have your results there.
Kirill Eremenko: Yeah, yeah, absolutely. Maybe handle some exceptions
at the most that [crosstalk 00:57:19].
Samuel Hinton: There's always exceptions, isn't?
Kirill Eremenko: Yeah. Okay. How do you teach data science pipelines?
Give us a teaser. What's a workshop on data science
pipelines look like?
Samuel Hinton: Probably a lot of code. There's no way around it.
Maybe one or two slides to try and illustrate the topics.
I guess, the way that I have it in my head at the
moment is it's essentially a collaborative coding.
Everyone's coding their own thing. I will have my pre-
done version and then people can deviate from that as
they will. Probably something like Google Colab or
Jupyter labs in some instance, just to give everyone
the basics.
Samuel Hinton: How you can throw these things together? How you
can chain all these methods in a robust way? Then,
how you can tie that into your machine learning
product? Hopefully you start with here's a bunch of
raw data files. Then, at the end, you press a button
and out come all your products that you want.
Obviously, the way that this has to be done, the
workshop, may be different to how people do it in
industry.
Samuel Hinton: If you have a very large data set, if your data set is
either billions of records or thousands of features, you
may not be able to run this on a laptop. You may need
to ship it out to high performance computers, submit
it to some sort of batching job on a cluster like SLURM
or SGA, a whole bunch of them. That's very difficult to
do in a workshop. No one wants to spend two hours
setting up and trying to apply for accounts. We'll have
to cover a representative but basic example. Then, give
people the skills or the pointers as to how they can
scale that up.
Samuel Hinton: Whether they're using things like Dask to try and scale
out to clusters or whether they just need to know this
is how you submit jobs to a supercomputer.
Obviously, you can't do all of that in a single
workshop. We'll have to cover the basics as much as
possible. Then say, for your use cases, this is where
you want to go. For your use cases, you're going to go
look over at this. That's actually something that I think
we're going to run a survey with the people that
responded to the first survey saying, what are your use
cases? What, in your mind, is a good data science
pipeline? What do you want out at the end? What are
the products that you're talking about? What are the
inputs that you're dealing with? Are we talking about
megabytes, gigabytes or terabytes of data?
Samuel Hinton: The pipeline changes depending on all of these
questions. That's something that we really need from
the people going to the conference is what are their
use cases? Because only with that knowledge, can we
create an effective workshop that actually benefits
them at the end of the day.
Kirill Eremenko: Absolutely. Yeah. That's one of the reason why we ran
the first workshop to know exactly what people want.
Sorry, so we run the first survey. Now, we're going to
run the second one. If you're listening to this, and for
some reason you weren't part of the second survey,
maybe we missed your response, in terms of
identifying you as a key participant for the second
survey or these more in depth interviews that we're
conducting, please send either our team or Sam
directly, preferably Sam directly, an email.
Samuel Hinton: No, no, no preferably the team.
Kirill Eremenko: You can send us an email. We'll include it in the show
notes for this episode, so you can find it there.
Basically, send us an email and explain exactly what
you would like to see in this data science pipelines
workshop. When this goes live, we will still have time
to incorporate your feedback.
Kirill Eremenko: Yeah, that'll be cool. I'm looking forward to it. It'll be
virtual so there'll be people from all over the world.
Yeah. You have so much experience with this now
especially with this COVID stuff.
Samuel Hinton: Yeah, sounds like fun. No pressure though. Yeah?
Kirill Eremenko: Hope you find time to not go crazy with all this stuff
around ...
Samuel Hinton: Fingers crossed.
Kirill Eremenko: Another thing that I had in mind, once you're talking
cosmology with my girlfriend? What's your website
again? What's that wonderful website?
Samuel Hinton: cosmiccoding.com.au.
Kirill Eremenko: Yeah. Everybody cosmiccoding.com.au. Don't forget
the .au, we're very special in Australia. Yeah. Amazing.
I love the talk. If anybody's looking, it's called the Dark
Side Of The Universe by Sam Hinton in the
BrisScience lecture. I've watched about three quarters
of it so far, or maybe two thirds. Amazing. I loved it. I
knew about dark matter. I didn't know there was even
more dark energy in the universe. That's crazy, man.
Samuel Hinton: Yeah. I wish I knew what they were.
Kirill Eremenko: Yeah. I'll due to that. Some cool things. I like that you
provide the evidence of how the charts fall in place and
that this is not just like voodoo stuff. It's actual ...
Samuel Hinton: Yeah. It's probably the most common question I get is
like, well, what if dark energy and dark matter is just a
mistake? What if Einstein was wrong? I was like,
"Okay, he could be wrong but there are dozens of
independent, different avenues of investigating that, all
come to the same conclusion." If it's a mistake,
someone needs to come in and say how, because we
have very, very substantial evidence that it's a real
thing. We just don't know exactly how it's supposed to
function or where it came from.
Kirill Eremenko: Yeah.
Samuel Hinton: One day, we'll have the answer.
Kirill Eremenko: One day, maybe. A million years from now.
Samuel Hinton: Pretty much.
Kirill Eremenko: Is that what your job in America is going to be about?
What's this postdoc?
Samuel Hinton: Yeah, so the postdoc is to investigate dark energy and
dark matter primarily using two different probes of the
universe. The first being Type 1a Supernova and the
second being the large scale structure of the universe.
To give a very, very tortured and brief intro to both of
them. Type 1a Supernova are sort of exploding star
that explode around the same brightness every time.
What you can do is you can use them to map out the
history of the universe, because remember, light takes
time to travel. A galaxy that has a supernova that we
see that's a billion light years away, well, that
supernova exploded a billion years ago.
Samuel Hinton: Because they're all the same brightness, it means we
can figure out how far away that galaxy is by how dim
the supernova is. If you have a light and you start
walking away, if you walk twice as far away, the
dimness of the light is now a quarter, because the light
is spreading out to cover the area of a sphere, the
sphere 4pi R squared. The idea is with this standard
candle, we can map out the evolution of the universe.
Samuel Hinton: The evolution of the universe changes depending on
the properties of dark energy and dark matter. The
better we can constrain the evolution, the better we
can determine the properties of those mysterious
components. Then hopefully, a theoretician will come
along and say, "I propose dark energy is this with
these properties." We say, "That works or that doesn't
work." At the moment, the leading one is Einstein, who
just said, "Dark energy is probably just if space itself,
empty space, had energy, turns out that fits with
everything."
Samuel Hinton: We just don't know why it should have energy. You
take quantum mechanics and you calculate how much
energy the empty vacuum should have, it's not zero,
right? Quantum mechanics says there should be
energy, but it says there should be so much more
energy, 100 magnitude more energy than we observe,
which is catastrophically wrong.
Samuel Hinton: The second thing, the second probe is large scale
structure. Let's see, what's the easy way to explain
that? The universe is big now. In space, no one can
hear you scream. That's true. Remember, the universe
is expanding. If we go back in time, the universe gets
smaller and smaller but the amount of stuff doesn't
decrease. The amount of stuff stays the same, but it's
now in a much smaller volume. Space goes from being
empty to being filled with stuff like Earth's
atmosphere. It goes to be thick and dense and it
becomes a fluid. Because [crosstalk 01:05:23].
Samuel Hinton: Yeah. If you go all the way back to right after the Big
Bang, space looks like a fluid. It's got so much stuff in
it and space is smaller, it acts like a fluid. What that
means is, well, quantum mechanics says that right
after the Big Bang, some parts of the universe have
just a little bit more energy than other parts. Energy,
mass, light, they're all the same thing at this point in
time, so it has a bit more stuff.
Samuel Hinton: Imagine you blow up a balloon in the atmosphere.
That balloon, the area inside the balloon, has more air
than outside. You pop it, you get rid of that elastic
shell and what happens? You hear the pop, the air
spreads out. It's a little shock wave. It's generally not a
shock wave because it's just air pressure moving. It's a
sound wave. It's an acoustic wave. You have these in
the early universe. You have these acoustic waves from
these over dense regions spreading out.
Samuel Hinton: Imagine, it's like you've got a still lake and it starts to
rain, you can see all the ripples from the raindrops
spreading out. That's what the early universe looks
like. I'm taking a little bit of time here, but space was a
fluid back then and it's not a fluid now, which means
at some point, it went from a fluid to not a fluid. This
actually happens incredibly quickly in astronomical
terms. We're not talking billions of years or millions of
years. We're talking thousands of years, very quickly.
Samuel Hinton: Imagine you've got this lake that's been rained on with
all the ripples spreading out. Then, suddenly,
instantly, the lake snap freezes for some reason.
You've got all the ripples will now be imprinted in the
ice because it's frozen straight away. That's what we
see in the universe except we can't hear it. We can't
hear the ripples. If we take a telescope and we
measure 100 million galaxies, we can reconstruct the
ripples because the ripples are patches of over density.
Over density means more mass, more mass, more
gravity, more stars, more galaxies, the more stuff.
Samuel Hinton: If we simply figure out where there's more stuff in the
universe than other places, that's a ripple. We go out,
we find all these ripples, and we use them in a very
standard way that we use the Supernova. Instead of it
being a standard candle, it's a standard ruler. We
know how big the ripples are. We know that they had
300,000 years to expand. We know at what speed they
expanded because it's tied to the speed of light.
Samuel Hinton: If we find a ripple that's X big in some area of the
universe, we know that it started off as Y big. It's
thereby increased in size X over Y. We can figure out
how much any patch of the universe has expanded.
Again, we use that to map the expansion history. We
try and infer the properties of dark energy and dark
matter. That was about a five minute explanation. I'm
going to call it there. Anyone wants any more detail,
there's a whole bunch of videos on dark energy and
large scale structure. That thing is called the baryon
acoustic oscillation if you're curious.
Kirill Eremenko: Wonderful, wonderful. What do you predict some side
effects of your research? Will we have new MRIs or
anything like that?
Samuel Hinton: It's so hard to tell. The main benefit of current astro
research in breakthroughs that we have in deep
learning and machine learning. We obviously have
images of the night sky. We've tried to identify things
in those images. That's obviously very closely related
to things like identifying tumors or medical
abnormalities in MRI images or similar.
Samuel Hinton: At the moment, it's looking less like a tech
breakthrough, astrophysics gave the world digital
cameras a couple of decades ago. I think we're still
coasting that and we'll coast that for as long as we
can. For now it's just about sharing techniques.
Kirill Eremenko: Okay. Got you. All right. Looking forward to that, it
sounds very exciting. Hopefully, this job goes ahead
very soon. It's going to be fun.
Samuel Hinton: Fingers crossed.
Kirill Eremenko: Okay. One last thing I wanted to chat to you about
before we finish up is what books are you reading? A
person of your mind, of your breadth of applications
and knowledge in data science and other fields, surely
there are some interesting things you're looking into
where you get all this information from?
Samuel Hinton: Yeah. There are a few. There's a few books that I was
recommending recently on causal modeling that I have
on my to-read list. However, in the past few months, I
will admit that I have not touched a single textbook.
That's not that abnormal for me. With so much work
with having both of these jobs and working nonstop,
when I get a bit of downtime, I pick up my novels.
Samuel Hinton: I just need a break from all of this data science, all of
these data pipelines, I just need to turn off. I'm a huge
reader of fantasy. Brandon Sanderson, all of his books
and similar authors. I've read around, I think, 45
books this year.
Kirill Eremenko: Forty-five books this year. It is April.
Samuel Hinton: Yeah, it's ...
Kirill Eremenko: Ten per month.
Samuel Hinton: I normally average like one every day or two. If I have a
weekend off, I can read an entire book in a day. I
generally feel bad at the end of it. I feel like oh, wow,
you really should have done something else. You could
have been at least a bit productive.
Kirill Eremenko: Yeah. Okay.
Samuel Hinton: I try not be, you know.
Kirill Eremenko: It's crazy, man. It takes me a month to read a book
sometimes. What's the most memorable book even if
it's fiction that you read this year?
Samuel Hinton: Geez. I don't even know the name of them. That's the
issue, I don't keep track. I was like, that's a good book.
I will download it, read it and go on to the next. I don't
remember the authors. I don't remember the names.
Let's see. Hang on, give me one second. I have the
online repository in front of me. Let me just open
books. Yes. Okay. I think one of the ones that I liked
the most by Andrew Rowe was Sufficiently Advanced
Magic, which is just a light hearted fantasy
progression style thing. That was nice.
Samuel Hinton: Then, there was the Cradle series by Will Wright, it
was seven books or so. I read those in about three
days.
Kirill Eremenko: How do you spell that?
Samuel Hinton: Small books.
Kirill Eremenko: Yeah. Small books. Do you do like some speed reading
or something like that?
Samuel Hinton: I used to. I try not to anymore. I did speed reading
years ago and I got up to like several thousand words
a minute. I just realized that there's no fun. If you read
a book in an hour, A, my retention is horrible. I read to
relax. What am I supposed to do if I read everything I
have on my Kindle or my phone in a single day? No, I
probably still read exceptionally quickly but I no longer
try and speed read.
Kirill Eremenko: Okay.
Samuel Hinton: It's probably still abnormally fast but I just have to live
with that.
Kirill Eremenko: It is. It is. Very impressive that's 45 books in this year.
I'm put to shame. I've read like three, two or three.
Yeah, okay. All right, cool. Thanks for
recommendation, Andrew Rowe and Will Wright, some
good fantasy books if anybody's looking for any.
Kirill Eremenko: Yeah, I think we've covered off everything. Is there
anything you wanted to touch on before we wrap up?
Samuel Hinton: Not particularly. I'm just hoping that I get a weekend
off in the next couple of months and can sit down and
just chill out and perhaps go and actually read a
textbook for once. That would be nice, a change of
pace when things slow down enough that I can just
breathe and learn. Because learning is, I think, one of
the things that I like most. It's so hard to find the time.
Samuel Hinton: If anyone out there is currently stuck in iso and is
dedicating themselves to going through courses and
upskilling themselves, I think, that's an absolutely
fantastic use of time. I wish everyone during that the
absolute. Best of luck. I know that there's a bunch of
people that have lost their jobs recently, and at least
trying to make a positive use of the time that we all
have to spend at home is as much as we can ask.
Kirill Eremenko: Fantastic. Thanks, Sam. A huge thank you on behalf
of all our listeners for doing what you're doing. You're
helping the world. Even though you don't have any
free time and it's hectic, somebody's got to do the job
and you're the best fit for this from the people I know
for sure. Thanks so much for what you're doing.
Awesome.
Samuel Hinton: My pleasure.
Kirill Eremenko: Before we wrap up, where can our listeners find you?
What's the best places to contact you? Follow your
work? Get in touch?
Samuel Hinton: Let's see. I mean, LinkedIn is an easy one. You can
send me a message there. I don't check it often but I
do check it eventually. That's probably the best way
because most emails or whatnot that I get in the data
science route are just as easily done on LinkedIn. No
one really wants Instagram. I try not to do work on
that. Apart from that, yeah, hit me up on LinkedIn,
probably.
Samuel Hinton: If there are any urgent queries, feel free to send me an
email. Just know that I am incredibly swamped with
emails at the moment. I don't know if I'll have time to
respond in the next couple of weeks.
Kirill Eremenko: Got you. Sam's website if anybody is interested in
watching his lecture is cosmiccoding.com.au. Very,
very cool. Thanks again, Sam. Great, great pleasure
chatting on the show. Awesome as always.
Samuel Hinton: Thanks for having me, mate.
Kirill Eremenko: There we have it. Thank you so much everybody for
spending your time, investing your time into this
episode and learning alongside with us. I hope you got
a lot of valuable takeaways. Yeah, so much cool stuff,
so many cool things. Without a doubt, my favorite part
of this episode was all the things that Sam is
describing about the COVID Critical Care Consortium
where he is the lead data analyst and all the
takeaways he's getting.
Kirill Eremenko: Also, the insights into what it's like to work with real
world data, how messy it is? What challenges come
up? I think that was a great refresher. Some projects,
especially if there are course projects or projects
prepared for you by somebody else can be too clean,
like too void like they might not have any messiness in
the data and anybody can be led to believe that data
science is like that. It's not. It's actually very, very
complex. There's a lot of missing data. There's a lot of
normalization that needs to happen. A lot of pre-work
of the data, building the data pipeline, all of that is
super valuable.
Kirill Eremenko: Speaking of data pipelines, make sure to check out
Sam's workshop at DataScienceGO Virtual.
DataScienceGO Virtual is happening at the end of
June, this year, 2020. You can get your ticket
absolutely free if you go to datasciencego.com/virtual.
Just be careful, number of seats is limited. This is our
first time doing an online virtual event. We've done this
event many times in real life in California for many
years. This is our first virtual event. The number of
seats is limited. Make sure if you want to get in, apply
for your seat today at datasciencego.com/virtual.
Kirill Eremenko: You'll see Sam running a workshop on data science
pipelines. You'll actually be able to code along with
him and create your very own data science pipeline.
Make sure not to miss that. As usual, you can get the
show notes for this episode at
superdatascience.com/367. That's
superdatascience.com/367. There, you'll find the
transcript for this episode, any materials we
mentioned, including books and the URLs to Sam's
LinkedIn, website, his presentations online, and any
other fun things that might help your learning growth
in data science.
Kirill Eremenko: There we go. Make sure to check that out as well. On
that note, thanks so much for being here. Sam and I
are looking forward to seeing you at DataScienceGO
Virtual in a couple of weeks. Apply for a ticket today if
you haven't yet. Until next time, happy analyzing.
Recommended