Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
SDS PODCAST
EPISODE 385:
ADVANCED DATA
TOPICS AND
PEOPLE-CENTERED
DATA SCIENCE
Kirill Eremenko: 00:00:00 This is episode number 385 with Lead Data Scientist,
Scott Clendaniel.
Kirill Eremenko: 00:00:12 Welcome to the SuperDataScience podcast. My name is
Kirill Eremenko, Data Science Coach and Lifestyle
Entrepreneur. Each week, we bring you inspiring people
and ideas to help you build your successful career in data
science. Thanks for being here today, and now let's make
the complex simple.
Kirill Eremenko: 00:00:44 Hello everybody, and welcome back to the
SuperDataScience podcast. Super excited to have you
back here on the show. Today, I've got a super amazing
treat for all of us. I just got off the call with Scott
Clendaniel. Scott is a lead data scientist there. He has a
huge amount of experience in this space of data science
and machine learning, and he's always happy to give back
to the community. This podcast is going to be an
advanced podcast. It's specifically going to be useful to
you if you are an intermediate practitioner in data science
or an advanced practitioner in data science.
Kirill Eremenko: 00:01:18 You're interested in things like models, cross-validation,
over sampling and things like that. With that said, here
are some of the topics that you'll hear about today. You'll
hear about Scott's story and how he got into the space of
data science. We'll talk about fraud detection because it's
a big part of the financial services industry. We'll talk
about some specific examples of ways to detect fraud,
including Benford's law. We'll talk about oversampling,
the minority class, the multiplicity of good models, the
tools that Scott uses on a daily basis.
Kirill Eremenko: 00:01:51 We'll talk about data preparation techniques. Specifically,
we'll talk about target, mean and coding, and one hot
encoding, what they are, which one is better, and why.
Then we'll talk about model drift, and we'll discuss why
models decay over time. Scott will give his
recommendations on how often he checks up on models.
We'll talk about building populations to build to report.
We'll talk about some real world examples, cross-
validation. Then we'll cover off Scott's advice on some of
the softer skills like data science leadership, what it
means to manage data science teams, how to best
structure a data science team, the hub and spoke model
for that.
Kirill Eremenko: 00:02:28 We'll get Scott's ideas and visions for what is coming for
data science in the future. A very exciting podcast coming
up ahead, I can't wait for you to check it out. Without
further ado, I bring to you Scott Clendaniel, lead data
scientist at Franklin Templeton.
Kirill Eremenko: 00:02:51 Welcome back to the SuperDataScience podcast,
everybody. Super pumped to have you back here on the
show. Today, we've got a super special guest joining us,
Scott Clendaniel. Scott, welcome.
Scott Clendaniel: 00:03:02 Thank you so much. I'm really happy to be on.
Kirill Eremenko: 00:03:05 That's awesome. I forgot to ask you, where are you
located right now?
Scott Clendaniel: 00:03:09 I am actually in Havre de Grace, Maryland off the
Chesapeake Bay, about 45 minutes north of Washington,
DC.
Kirill Eremenko: 00:03:16 Havre de Grace, Maryland.
Scott Clendaniel: 00:03:20 Absolutely. [crosstalk 00:03:23].
Kirill Eremenko: 00:03:23 Havre de Grace is the name of the city.
Scott Clendaniel: 00:03:26 Yes. Port of mercy, I believe, is the loose translation.
Kirill Eremenko: 00:03:32 Very interesting. How did you end up there? Have you
been there for a long time?
Scott Clendaniel: 00:03:38 Actually, we just moved here. We had an opportunity
because my job allows me to work remote to be able to
change location. My wife as a children's librarian just got
a position up here near Cecil County. We just moved
here, and we're pleased as punch. It's really pretty.
Kirill Eremenko: 00:03:57 All right, tell us... Before that, where were you located
before that?
Scott Clendaniel: 00:04:03 I was actually born in Baltimore, Maryland, lived there for
a long period of time. Then I was in Delaware,
Pennsylvania, and a few years in Honolulu, Hawaii.
Kirill Eremenko: 00:04:15 Amazing. I've always wanted to go to Hawaii. What is
Honolulu like?
Scott Clendaniel: 00:04:20 Honolulu is really a fantastic city. I really loved living
there. Obviously, a lot of tourism, but the people there are
just so warm and inviting, really had a good time there, a
big fan of Hawaii.
Kirill Eremenko: 00:04:38 What Island is Honolulu on?
Scott Clendaniel: 00:04:39 Oahu, which is the main Island. About three quarters of
the population lives on that island. There's the total
[crosstalk 00:04:48].
Kirill Eremenko: 00:04:47 Wow. I heard there's a great restaurant on Maui, and it's
called Mama's Fish House, I think. Do you know that?
Scott Clendaniel: 00:04:56 No, I don't know that one. I haven't been there, but I
visited four of the eight islands while I lived there. Maui is
very pretty. Each island has its own personality, which is
fun.
Kirill Eremenko: 00:05:10 What's the differences in personality?
Scott Clendaniel: 00:05:10 Well, Kauai is the garden island, so it tends to be much
more laid back. It is probably the greenest of the islands.
That's great. The big island has all kinds of different
topography. It's probably the only place in the world
where you can go snow skiing and water skiing on the
same day.
Kirill Eremenko: 00:05:28 Wow.
Scott Clendaniel: 00:05:29 Each island has its own unique factors to it. Maui's a lot
of fun too. Oahu is where most people go. That's where
Waikiki is. That's probably the most popular of the group.
Kirill Eremenko: 00:05:41 Oh, fantastic. Very cool. The mountains are tall enough
for skiing?
Scott Clendaniel: 00:05:46 Only on the big island. When I say snow skiing, I'm not
talking about Aspen or Vail, Colorado. I mean, you can
get about that, but you can at least tell people, "Hey, this
is fantastic. On the same day, I went water skiing and
snow skiing."
Kirill Eremenko: 00:06:02 Amazing. Okay. Got you. Well, Scott, it's a pleasure to
have you here. We've got a lot to go through. For those
who are listening and maybe don't know, I posted on
LinkedIn just, "Hey, Scott Clendaniel is coming to the
show in 24 hours, and post your specific asks for
advanced data science questions." In that 24 hours,
there's now 56 messages in there. Thank you very much
for taking the time to really answer to some people.
There's been a lot of cool discussions and [crosstalk
00:06:33].
Scott Clendaniel: 00:06:32 Absolutely.
Kirill Eremenko: 00:06:34 I really want to go through [crosstalk 00:06:36].
Scott Clendaniel: 00:06:35 Very few of them were from my bill collectors, which I
really appreciated.
Kirill Eremenko: 00:06:41 Gotcha. What I want to start with though, so we'll
definitely get to those, and there are some really cool
advanced machine learning questions, model and cross
validation and things like that. Before we get there, give
us a bit of a background around yourself. Who is Scott
Clendaniel?
Scott Clendaniel: 00:07:02 Sure. Actually, I have a bit of an unusual background. My
undergraduate, my MBA are both in strategic planning.
They're not in either statistics or computer science, which
makes it relatively different from a lot of other folks in the
field. I was in financial services for a long period of time,
and all the way up through vice president of consumer
lending at Bank of Hawaii. Unfortunately, I had a family
emergency where my ex-wife at that time decided to take
my son, and so I had to leave.
Scott Clendaniel: 00:07:33 She had indicated to my son who was only three years old
at that time that I had actually gone, because I was
packing his toys in Hawaii. I was like, "Why did you do
that?" Anyhow, so I got a phone call from my three-year-
old son one day. He said, "Daddy, are you done packing
my toys yet?" It was the worst thing ever. I was like,
"Gosh, I've got to figure out how to give up my career in
financial services and do something else."
Scott Clendaniel: 00:08:00 I thought to myself, "I wonder if anyone's interested in
this machine learning artificial intelligence stuff. I wonder
if I could do that." For the next 16 years, I became a
consultant. To all those folks out there who are trying to
break into data science, if I can get past that, you can get
past whatever you're facing currently. I encourage
everybody to give it a shot.
Kirill Eremenko: 00:08:23 Wow. Wow. What a story. You just packed your stuff,
gave up a vice president position at Bank of Hawaii, and
moved back to mainland U.S. How did that go for you?
Scott Clendaniel: 00:08:36 That was rough [inaudible 00:08:38], but it opened up a
world of opportunity. It also gave me a whole new
perspective on what's important in life and what isn't, and
also gave me a lot of focus on persevering and problem
solving and how I could add value to others. It was a
tough thing to go through, but it provided a lot of
opportunities for me later on in life.
Kirill Eremenko: 00:09:03 Wow. Wow. Amazing. 16 years past. You were consulting
in this space. What happened next, and where are you
now?
Scott Clendaniel: 00:09:16 Absolutely. Morgan Stanley actually recruited me to be
their first full time data scientist in their Baltimore office.
That was great. Unfortunately, they had a situation with
some internal fraud, where... I can share this because it
was on the front page of the New York Times. Someone
walked out the front door with $11 million. They suddenly
changed my role to focus exclusively on internal fraud,
which wasn't really what I wanted to be doing at that
point in time.
Scott Clendaniel: 00:09:46 I wanted to stick with machine learning. I was recruited
away from Lake Mason, and had been there for two and a
half years. We just got purchased by Franklin Templeton,
so I've been trying to help build functions for those
organizations.
Kirill Eremenko: 00:10:02 Just to give a bit of background on the Franklin
Templeton, this is one of the world's largest global
investment firm.
Scott Clendaniel: 00:10:11 Yeah. Once the transaction is completed, which will
probably be about the time this podcast is released, it'll
be approximately $1.4 trillion in assets making us the
sixth largest in the world.
Kirill Eremenko: 00:10:23 Wow. That is crazy. As a lead data scientist at this
company, what is your role like? Do you actually look at
how to invest this money, or are you looking for fraud, or
is it like a broad scope of locations? What exactly do you
do?
Scott Clendaniel: 00:10:41 Most asset management firms have separate groups who
do the actual portfolios of investments, so I'm not
involved in that so much. I help working on other types of
business problems like optimizing sales, trying to help in
profiling customers and opportunities, occasionally some
small pieces of fraud detection, and actually trying to
educate the organization as a whole on best practices and
analytics and trying to make sure that we can meet the
academic component of what's going out in the world
versus the real world, and trying to bring people up to
speed, do a lot of training for folks, helping them form
their own business plans and helping them build their
models. It's our reaching.
Kirill Eremenko: 00:11:28 Gotcha, so quite a broad scope of applications, but not
specifically the investment management. Interesting. Very
interesting.
Kirill Eremenko: 00:11:40 I hope you're enjoying the podcast, and we'll get straight
back to it after this quick announcement. This
announcement is going to be a bit tough for me because it
is about my own book, so please excuse the shameless
plug. However, I do believe in it so much that I want to
get the word out there. This book is designed in a way to
get anybody and everybody up to speed with data science,
pretty much everything that is important that is needed
to get going.
Kirill Eremenko: 00:12:05 The unique proposition of this book is that it doesn't
require coding. There's a lot of books out there on data
science where you need to sit in front of the computer
and code in Python or R. This book, you simply take and
you can read it on your lap, in a car, in a plane, in your
backyard, on a couch. You can read anywhere. There is
no coding in the book. It focuses on intuition. If you've
taken our course, then you'll like those intuition tutorials
about how an algorithm works and why rather than what
the code behind it is, then you're going to love this book.
Kirill Eremenko: 00:12:41 It's going to be a great way to solidify that knowledge. It
covers pretty much everything in a data science project
lifecycle from asking the right questions to data
preparation, to machine learning, to visualization and
finally presentation. Pretty much everything you need in
your career is covered. If that sounds exciting, check it
out. It's called Confident Data Skills available on Amazon.
It's the data science book with the purple cover, and
please enjoy.
Kirill Eremenko: 00:13:11 Well, on that note, I think let's dive into these questions
because there's quite a lot to go through that, and I think
that'll take us-
Scott Clendaniel: 00:13:23 Let's go.
Kirill Eremenko: 00:13:23 All right, awesome. I've gone through the questions. This
was on LinkedIn. I've sorted them in order of... We're
going to start with the most advanced machine learning,
AI, deep learning stuff first.
Scott Clendaniel: 00:13:34 Fine. No pressure on me. That'll be great. Okay, go ahead.
Kirill Eremenko: 00:13:38 Here we go. Vighnesh, if I'm pronouncing that correctly,
Vighnesh asks, "Can you give us a brief about the real
world applications of data science in the investment
industry? How do you approach a particular problem in
this space?"
Scott Clendaniel: 00:13:58 Sure. I think one of the big components in any field is
actually trying to make sure what is the business problem
and defining that first before you actually define your
modeling approach. In investment management, there are
a bunch of different problems where data science can be
applied, and also, financial services have been involved in
advanced analytics since at least the 1960s, so it's a great
field to be in. A couple of examples would be fraud
detection. How do you tell whether a particular
transaction is someone spoofing, whether they're the real
person or not?
Scott Clendaniel: 00:14:31 There's developing portfolios in which assets should be in
there. You can look at time series forecasting of how an
individual investment is going to perform. You can also
look at things. One of the problems I've been working on a
lot recently is sales optimization, so how does a financial
advisor look at the broad palette of customers and
potential customers, and figure out who should be
prioritized in terms of what their needs are and coming
up with a product recommendation on what would fit
their needs?
Kirill Eremenko: 00:15:05 Gotcha. It ties in well with what we just spoke about that
there is a broad scope of applications in business
problem. Gotcha. What specific AI technologies have...
This is Matthew. Matthew asks, "What specific AI
technologies have changed the investment industry, and
which do you predict will shape the industry in the next
five years?"
Scott Clendaniel: 00:15:28 Sure. I think the development of additional algorithms to
be available to us has changed AI quite a bit. Deep
learning, perhaps not as much as others, but things like
xgboost and algorithms that allow for ensembling have
really helped the industry quite a lot. Also, approaches in
terms of anomaly detection for fraud detection, they've
been huge contributors as well. Those are probably the
changes in AI that have impacted the most.
Scott Clendaniel: 00:16:00 Also, the growth of open source have made it very difficult
for organizations to say no. There was a time many, many
decades ago where people would say, "Oh, no, I'm sorry.
We can't do anything without a software package that
costs $90,000." Now, they can't say that anymore. I think
that actually had probably the biggest impact on the
growth of AI overall.
Kirill Eremenko: 00:16:23 Gotcha. Gotcha. In the next five years?
Scott Clendaniel: 00:16:27 Next five years, I think there are going to be huge
opportunities in terms of predicting credit performance
and also fraud detection. Those are extremely difficult
problems to solve, and having the more advanced AI
technologies, I think, are going to continue to help in that
arena, especially fraud detection, because it keeps
changing. What appears to be fraud in one given quarter
may look very different the next quarter, because the
fraudsters are always adapting and changing their
approaches.
Scott Clendaniel: 00:16:58 So you need to have a technology that allows the models
to continue to grow over time. You can't just pick a point
in time and say, "Okay, we know what fraud is," because
it won't be the same next quarter.
Kirill Eremenko: 00:17:09 Gotcha. Before I forget, I wanted to say that Scott is
sharing his comments today on behalf of himself and not
on behalf of the organization that he works at. These are
just opinions at the end of the day, our opinions.
Scott Clendaniel: 00:17:27 It's all my fault. I want to make myself really clear.
Everything I say is my fault. No one else's.
Kirill Eremenko: 00:17:32 Thank you. Thank you, Scott. Speaking of fraud, do you
know what the size of this problem, globally or in the U.S.
is per annum?
Scott Clendaniel: 00:17:47 I don't have recent statistics on that, but it runs into the
several billions of dollars. The challenge is the fact that
you not only lose the profit on the given transaction that
would come in if it turns out to be fraudulent, but you
lose the entire dollar amount. In financial services, your
inventory is actually the dollars that you manage. If you
have a fraudulent transaction, you lose every bit of that
inventory along with any type of profit you would have
made.
Scott Clendaniel: 00:18:18 It runs into many, many billions of dollars, so it is a huge
issue. It's also really complicated because the fraud rate,
the percentage of transactions that are fraudulent tend to
be very low, but its financial impact is ridiculously large.
It's a real class and balance problem.
Kirill Eremenko: 00:18:36 I did a quick search. The global fraud market size for
fraud detection and prevention market size is valued at
$17.3 billion. As you said, lots and lots [crosstalk
00:18:51].
Scott Clendaniel: 00:18:51 That's just the market to stop the fraud. That's not the
product sales. You've got a real clear picture of how big of
an issue it is.
Kirill Eremenko: 00:18:58 That's a good point. I only know one... saw quite...
Intuitively, I know what... I've read about one specific
broad algorithm that I could confidently explain, and
that's Benford's law. Are there any algorithms that you
can share with us?
Scott Clendaniel: 00:19:23 Sure. Actually, I'll make a recommendation of something
to be careful with. There has been a huge amount of
press about using anomaly detection for fraud detection.
That is very helpful. It does have some pretty severe
limitations though, and that, a given fraudster is going to
try very hard to not look like an anomaly. In other words,
to some extent, the data is actually fighting against you.
The fraudster is trying to look as similar as possible as
they can to the mean of any given transaction.
Scott Clendaniel: 00:19:57 They're actively fighting you to not look like an anomaly.
The problem is the false positive rate on anomaly
detection is enormous, and it's very difficult to fight
against. Just using anomaly detection, regardless of how
sophisticated the version that you're using is, tends to
have some severe limitations that you're going to come up
against, so just be aware there should be one tool in your
arsenal. It shouldn't be the be all and end all.
Kirill Eremenko: 00:20:28 Gotcha. Anomaly detection is one. That's fantastic. I'll
share Benford's law. Benford's law is more of a aggregates
tool. It doesn't look at individual transaction. It looks like
as a whole. The way we were taught it at Deloitte is that if
you take a... I might be paraphrasing what they told us
back then. Again, I'm speaking from my opinion as well.
You take all the transactions on all of the dollar values on
a balance sheet of a company.
Kirill Eremenko: 00:20:59 Just take all of them. Mix them all up. Put them into a
bag, and then look at the distribution of the first... Is it
the first? No, of the first digit in all of these transactions.
What is the first digit of all of these amounts? What's the
first digit? How many number ones are there? How many
number twos are there? How many number threes are
there? The leading digit, and Benford's law says that the
distribution should be... They should be 30% once, 17%
twos, 12% threes, 9% four.
Kirill Eremenko: 00:21:40 It drops off as you go further. It's an intuitive thing, and
that is something that is really hard to fake, right? If
you're faking a balance sheet and you're making up
numbers. You don't think of that distribution in mind.
You might make your numbers look really believable, but
overall, when you take the distribution on the first digit
across all the numbers that are going to fall on Benford's
law. That's how a qualified expert can tell that, "Hey,
there's something going on here."
Scott Clendaniel: 00:22:10 In forensic accounting, that becomes really important.
That is definitely one of the warning signs. There are a
couple of interesting things about Benford's law in
addition to what you said. I think you gave a great
explanation of it. Benford's law seems to apply even if you
change the base unit. If it's not a decimal system, if you
used a base eight system, you will tend to see very similar
patterns. The percentage of digits will change, but I just
find it amazing that even if you change the base number
that you're using, that it tends to show up.
Scott Clendaniel: 00:22:45 It's great for things like reviewing the balance sheet, just
as you mentioned. What becomes tricky is the fact that
when you're dealing with consumer transactions, for
example, the actual transactions, you don't apply as
much for the trailing digits. Everybody wants to charge
$1.99 or $1.95. They don't tend to charge $1.03. You
actually have the opposite problem there, so you have to
be careful. Benford's law is extremely powerful when
you're looking at a whole collection of numbers given in
one instance such as the balance sheet. It becomes much
harder when you're trying to look at individual
transactions.
Kirill Eremenko: 00:23:24 Awesome. Awesome. Thank you. Well, there we go. That's
two techniques in fraud detection. Speaking of fraud
detection, we have a question from, again, Vighnesh who
says, "While working on fraud detection problems, most of
the times, we come across imbalanced datasets. Can you
please put a light on how to overcome such problems or
how to resolve it?" Maybe to start off, what does it mean
that the data set is imbalanced?
Scott Clendaniel: 00:23:52 Sure. This is typically referred to as class imbalanced. In
other words, if I'm trying to do a classification problem,
and let's say that I'm trying to do fraud versus non-fraud.
If you look at the distribution of how many transactions
are fraud and how many transactions are non-fraud, your
fraud rate tends to be relatively small down into the tens
or hundreds of a percent. The problem is if you try and
compare the fraud transactions versus the non-fraud
transactions, a lot of algorithms are going to choke on
that unless you adjust the balance of the dataset so that
you can have more of a 50/50 ratio between fraud and
non-fraud.
Scott Clendaniel: 00:24:33 Otherwise, what the model's going to do is go, "Let's see.
I've got 1,000 transactions. 990 are not fraud, and 10 are
fraud. I've got it. They're all not fraud." It's going to be
right 99.9% of the time. It's just going to mess everything
up, so you need to make [crosstalk 00:24:52].
Kirill Eremenko: 00:24:51 It's going to have fantastic markers.
Scott Clendaniel: 00:24:54 Yes. That's amazing. I'm done. I just said there is no
fraud, and I'm going to go home and have lunch. That
doesn't work out too well in the real world. What you do is
you tend to over-sample what's called the minority class,
so in this case, the fraud transactions. I might take most
of every fraud transaction I can get that is in my training
set. I'm going to compare it against an approximately
equal number of non-fraud. That means that the model
can't just arbitrarily say, "Okay, everything is not fraud."
Scott Clendaniel: 00:25:27 That's the technique that I use the most. There are other
techniques that you can use including doing all sorts of
complicated things with synthetic data or smoke
techniques or those types of things, but the over sampling
the minority class, lots and lots of the fraud and much
fewer of the non-fraud has been the technique that
worked the best for me. It's also very simple.
Kirill Eremenko: 00:25:50 What's the drawback? What's the potential danger of
using this technique?
Scott Clendaniel: 00:25:57 Part of the problem is you may not have enough fraud
cases to use in the first place. You may have such a small
number of records that you may not be able to use that in
its purest form, but you're definitely going to want to
move your sampling as close to a 50/50 ratio as you can.
Kirill Eremenko: 00:26:17 Gotcha. As long as you select the other ones, the non-
fraud ones at random, you shouldn't have any bias in
your model because you didn't use all of the available
random.
Scott Clendaniel: 00:26:29 Correct. Use of random becomes really important. Some
of the experts in the group who have classic statistical
background can talk a lot more about random versus
non-random and that there is no true random, but there
are all sorts of techniques you can use to make sure. For
most of the work that I've done, I just use random
functions, Python or SQL. That has worked pretty well for
me, and I haven't run up [inaudible 00:26:55] stations.
Kirill Eremenko: 00:26:57 Awesome. Speaking of Python and SQL, what kind of
tools do you use in your day to day?
Scott Clendaniel: 00:27:06 Python, Spark are the most common. Because I am older
than dirt, I started doing this all the way back in 1986,
God forbid. I actually grew up using gooey tools like
SPSS, which is now by IBM, Salford systems, which is
now owned by Minitab, lots of other gooey tools. There's
actually a free one, especially if you're just starting in the
field and you don't have developer background, called
Orange Data Mining, which is included in the Anaconda
distribution. Those were a couple places that you can
start, but eventually, it tends to turn into a lot more
Python as a starting point, and probably Spark if I'm
using a distributed system to build.
Kirill Eremenko: 00:27:55 Gotcha. You said Orange Data Mining.
Scott Clendaniel: 00:27:58 Orange Data Mining, yeah. It is not the prettiest program
you will ever encounter. The interface looks at least about
20 years out of date. Don't be fooled by how the interface
looks, because there is a lot of power underneath it. A lot
of people get turned off. They're like, "Oh, this doesn't
look cool. This doesn't look like something that was built
for Apple. I'm not going to use it." I would say that's a
mistake. They've actually done a really good job of
creating a gooey interface to sit on top, and then
underneath, they're primarily as Python. [crosstalk
00:28:29].
Kirill Eremenko: 00:28:29 Is this primarily for building models or fraud detection?
Scott Clendaniel: 00:28:36 Any type of model.
Kirill Eremenko: 00:28:38 Gotcha. Awesome. Awesome.
Scott Clendaniel: 00:28:40 I've been really happy with it. People laugh at me when I
show the screenshots, but it actually works pretty well.
Kirill Eremenko: 00:28:46 Gotcha. What kind of models have you noticed work the
best with fraud detection? We've got a huge range. K
means clustering. We've got Naïve Bayes. We've got
logistic regression. We've got xgboost, and so on. What
would you say are your go-to models? When you have a
fraud-detection problem, what's your first, second, third
choice?
Scott Clendaniel: 00:29:12 First of all, I'm going to throw out an old theory that I
hope people look into, which is essentially called the
multiplicity of good models, which means if you have the
right data and you've prepared it the right way, all sorts
of algorithms are likely to give you a very similar, positive
result. If you haven't done the data prep correctly, you
will start to see wide variances. That being said,
ensembling techniques of any type would be my favorite,
probably the most common being XGBoost.
Scott Clendaniel: 00:29:44 Also, as a data prep technique, I recommend target
meaning coding. That can be extremely helpful. In terms
of the final technique, I always recommend ensembling
because each algorithm has its own strengths and
weaknesses. If you'll ensemble a group of different models
together, you're likely to end up with a better result. The
final model is usually logistic regression based on the
inputs of the XGBoost and other types of techniques in
the family.
Kirill Eremenko: 00:30:15 Wow. Thank you, very, very detailed and advanced. This
data, perhaps, I think, you mentioned, target meaning
code.
Scott Clendaniel: 00:30:21 I got really passionate about this stuff, so I may bury you
in detail, and I apologize.
Kirill Eremenko: 00:30:24 That's okay. That's okay. That's okay. I want this to be an
advanced discussion, well, advanced learning for me.
Scott Clendaniel: 00:30:33 That's fine. That's fine.
Kirill Eremenko: 00:30:34 I wanted to ask you this data prep technique, target,
mean encoding. I don't know it. I haven't heard of it
before. Could you tell us a bit about it, if you can just
explain what does it do and how [crosstalk 00:30:44]?
Scott Clendaniel: 00:30:44 Sure. Let me give you a really simple example. Let's say
that we are an auto insurance company, and we want to
predict whether the old myth that red cars always cause
more problems than others. I've got a red car, a blue car,
a yellow car, a white car and a black car. I've got five
different car colors. I want to encode into my model this
categorical variable, so rather than use the original values
of the colors in the variable, you use a transformed
variable, so a new variable. Instead of recording the car
color there, you actually put the claim rate for each color.
Scott Clendaniel: 00:31:29 If a blue car has a 0.1% claim rate, you put 0.1 at any
time it's blue. If it's really red, you'll use 0.2% any time
red shows up. In other words, you convert the original
categorical input into its actual target mean. What's the
mean rate that this issue is going to come up with? You
use the transformed variable as opposed to the original
variable. It tends to work better than one hot encoding,
which you would usually use. Also, you can use it in
pretty much any algorithm, including algorithms that
only take numerical inputs like the original XGBoost.
Kirill Eremenko: 00:32:10 Wow. Wow. That's awesome. I hadn't heard of that one.
I've heard of one hot encoding, but still, do you mind
refreshing my memory on that, please?
Scott Clendaniel: 00:32:18 Sure. One hot encoding, so for each of our five car colors,
we're going to take the original variable, and we're going
to move it out of the dataset. We're going to create five
new variables. One variable is going to be, "Was the car
color white, zero, one?" If it's white, you can put one. If it
wasn't white, put zero. When you have a second column
that says, "Was the car color blue?" The third variable
says, "Was the car color red?"
Scott Clendaniel: 00:32:42 You can run into issues with that, and that you can end
up with a sea of categories, and you can overload your
model with way too many inputs.
Kirill Eremenko: 00:32:55 Gotcha. Gotcha. You also got to be careful of the dummy
variable trap there, right?
Scott Clendaniel: 00:33:00 Absolutely.
Kirill Eremenko: 00:33:00 You have to have one less than the original number of
categories. Gotcha. Very interesting. Thank you. That's
very exciting. Let's move on. Now, let's talk a bit about
models. Sonam, Sonam, I hope I'm pronouncing it right,
asks, "What are the parameters to look out for and or test
to perform that indicate model drift while monitoring a
production AI or machine learning model?" To start off,
what is model drift?
Scott Clendaniel: 00:33:33 Model drift is a silly name, but basically, it means that
the performance is going to drift over time. All models
tend to decay over time, and different models decay at
different rates. There are a couple of things that I'd
suggest with this. Number one, assume that your model
is going to decay over time. Don't guess it's going to
decay. Just assume it's going to decay, and set up a
schedule of when you're going to retrain any model you
have.
Scott Clendaniel: 00:33:59 That's the first component. Plan for obsolescence before
you put the first model in. That's very important. The
second thing is if I use what's called a population stability
report, which sounds like some bizarre sociology
experiment, but essentially, what it's doing is to say, "My
data that I started off with in one period, how similar does
it look to data that I'm looking at right now? Are those
two populations similar or not?"
Scott Clendaniel: 00:34:27 When the population stability report comes in and says,
"Hey, wait a minute, your data starts to look different,"
you definitely need to retrain your model, and that the
world that the model is trying to represent has changed.
Therefore, the accuracy of its predictions has changed.
Just assume it's going to happen. I get irritated with folks
who just want to go out there and say, "I have created the
world's greatest model, but it's based on data from seven
years ago. And I don't know why it's performing poorly."
Scott Clendaniel: 00:34:54 Come on, seriously? If you put in your model, test it
constantly, and use that population stability report to
keep your eye open to see that the world's changed.
Kirill Eremenko: 00:35:04 Gotcha. A couple of questions here. First one will be,
"Why..." This might be a naive question, but I'm just
tempted to ask, "Why do models decay over time? Why do
they never get better? Why is it that way?"
Scott Clendaniel: 00:35:21 Trick question. They decay over time because they
represent the world as they knew it, and the world
changes. The model does its absolute best to represent
the world as it saw it in your training set. When the world
changes, the type of data that shows up in your training
set may drift. Let's pick an easy example. Inflation, prices
tend to rise over time for lots and lots of things. If your
original model was based on prices from three years ago,
the prices now look different, so the model needs to adapt
to reflect that change to come up with representation of
what happens today.
Scott Clendaniel: 00:36:02 That's why all models tend to drift. However, that isn't
necessarily a bad thing in that you may have learned
more information over time. You may have a larger data
set to look at. You may have found new variables that you
didn't see before. You may have new algorithms you want
to test out. Actually in the end, you can end up with a
model that is more accurate than the last model was at
its peak. Over time you actually can get better. That's one
of the things that's exciting to me.
Kirill Eremenko: 00:36:32 Once you update it, of course, right? If you leave it alone,
it's not going to get better.
Scott Clendaniel: 00:36:37 No. I wish.
Kirill Eremenko: 00:36:39 Gotcha. In your answer on LinkedIn, so it was really
inspiring to see that you went through and answer to
everybody, you do a huge part for data science.
Scott Clendaniel: 00:36:52 Tried. If I missed anybody, send me a message.
Kirill Eremenko: 00:36:55 Awesome. You mentioned that six months is your magic
number to look at. Why is that?
Scott Clendaniel: 00:37:05 It is completely arbitrary. I'll tell you why. If you try and
make it annual, it tends not to happen. In the real world,
organizations are like, "I don't have it." We did it last year
or whatever. If you make it six months, you keep it top of
mind with everybody. Six months is the outer limit, and
then if the population stability report says, "Wait a
minute, the world looks different, or the performance of
the model tends to fall up." You've got a fraud model, and
your fraud rate keeps inching up.
Scott Clendaniel: 00:37:36 Either those two events, you look at it in a shorter period
of time. But if you actually build that into the calendar on
the front end, and also explain to your stakeholders,
"Model drift is a thing, and you need..." Modeling is a
process. It's a process of learning, and it's adapting to
change as information changes. Just bake that into the
schedule from the beginning, and you're going to be in a
lot better shape.
Kirill Eremenko: 00:38:00 Been there, it was in an organization once where they
didn't look at the model. Some consultants delivered a
segmentation model. They did... oh, no, prediction model,
prediction in terms of who will churn, who won't churn. It
didn't look at it for 18 months. When we had to look at
it...
Scott Clendaniel: 00:38:21 Ouch.
Kirill Eremenko: 00:38:22 Its accuracy dropped from 78% or something like that
down to 49, so it was better to flip a coin.
Scott Clendaniel: 00:38:32 Well, let me throw out a made-up word for everyone.
That's nonstop optimization. Instead of optimization, if
you're constantly nonstop trying to optimize, you call it
nonstop optimization. Senior management loves phrases.
I'm kidding. But if you take on that theory that I am
always going to be improving my model, I'm always going
that it's an organic process, that it's something that you
put in as a regular business process as opposed to the
snapshot one time event that's going to fix all our
problems. I think you'll be in better shape.
Kirill Eremenko: 00:39:06 Great. The population stability report, what do you look
at there? Do you look at means, distributions? What are
the [crosstalk 00:39:13] or maybe-
Scott Clendaniel: 00:39:14 It'll actually give you an indicator on a scale of zero to one
and on how much things have changed.
Kirill Eremenko: 00:39:21 It's like it's a library in Python?
Scott Clendaniel: 00:39:24 Yeah. For example, if incomes on the original data set
were $62,000, and now it's $140,000, something's wrong.
It also helps you to figure out if your data stream has
been corrupted. In other words, let's say the model works
fine when it has accurate data, but somehow, something
has happened, and now the data isn't as good as it was
before, you can then go in and say, "Okay, wait a minute.
Something's wrong. The population looks different from
what it was before."
Scott Clendaniel: 00:39:54 Maybe we've got a problem with the database. Maybe
we've got a problem in how the data was collected. Maybe
we've got a metric versus English units issue that
suddenly popped up.
Kirill Eremenko: 00:40:06 Gotcha. Anybody can go and download this population
stability report, or everybody needs to build their own
version of it?
Scott Clendaniel: 00:40:14 Yes. The formulas are out there. I can't recite it off the top
of my head, but if you just type into Google population
stability report, it can give you a walkthrough of that.
Kirill Eremenko: 00:40:22 Awesome. Fantastic. Speaking of data collection, there
was a question from Santosh. How do we check stability
and consistency in the process before using the data it
generated from model building? I understood there was a
bit of he meant one thing, and she first answered another
question, and then he answered the second one. Let's
start with the first one. The first one was like, "How do we
check for stability and consistency in terms of the data
collection process?"
Scott Clendaniel: 00:40:52 Sure. Usually, this is done sort of further upstream before
when we get data. This is usually done by the folks who
are doing your data ingestion or your ETL process on the
data upfront. The point he's making is extremely
valuable. It's as simple as garbage in garbage out. If the
data you're putting into your model has flaws in it, your
model isn't going to work. One of the things is actually sit
down with the people who are the stewards of that data.
Scott Clendaniel: 00:41:20 How is this collected? How often? How long have we been
keeping track of this? This is also one of the reasons why
data visualization is so important. See if the data makes
any sense before you start loading it into your model. His
point, I think, was data scientists are so excited. They
want to have a model. They want to have results. They
want to use their area of expertise. They want to pull the
algorithm out of the quiver and start shooting.
Scott Clendaniel: 00:41:47 That is a problem if you haven't checked the data
consistency upfront. I will give you a real world example
from a client who shall not be mentioned, where the
original data stream, there was some type of corruption
when data was migrated from one database to the next
database. If it was a dollar amount, they literally
physically typed the dollar sign in the value. If there was
a comma, they type the comma. If there was a decimal,
they type the decimal.
Scott Clendaniel: 00:42:16 At other points in time, this wasn't true, so it was just the
raw number. This is why you really need to understand
your data. To his point, you need to make sure that data
seems to make some type of sense. Sit down with the
steward of that data to make sure that you understand
what you're dealing with before you get too far down the
pipe.
Kirill Eremenko: 00:42:34 Awesome. Gotcha. Then you also talked about your
answers, something that intrigued me. You said that if for
some reason, the data is corrupt, then cross validation of
the model should also fail. Basically, we could probably
use cross-validation as I understood as a indication,
whether there's problems [crosstalk 00:42:58].
Scott Clendaniel: 00:42:57 Absolutely.
Kirill Eremenko: 00:42:59 Tell us a bit more about that.
Scott Clendaniel: 00:43:01 The great thing about models and cross validation is it's
virtually impossible to come up with a great cross
validated model based on bad data, because it just won't
work.
Kirill Eremenko: 00:43:14 What is cross validation?
Scott Clendaniel: 00:43:15 When you're taking cross validation, you're taking the
original data set, and you're dividing it into what they call
folds. Let's say we're going to take your dataset. We're
going to break it into five folds. You test the first model
built on four of the folds, and you leave out the fifth. On
the second one, you may use folds two through five, and
you test it on the first. You want to make sure that those
results look very similar.
Scott Clendaniel: 00:43:38 If they don't, you've got a problem in the model, so you
either average the results of the folds, or what I tend to do
is definitely do that first part, but also go back and say,
"What is weird about that one fold that doesn't seem to be
working very well? Why is the performance off for this
version versus that version?"
Kirill Eremenko: 00:43:56 Gotcha. Is the data in the folds randomized? Before you
select defaults, you randomize them.
Scott Clendaniel: 00:44:03 Absolutely.
Kirill Eremenko: 00:44:04 Gotcha. Just to recap, let's say we have, I don't know,
100,000 or 500,000 records. Then you break it down to
five groups of 100,000 each. In the first version, you train
the model on the first four groups of the data, and then
the fifth one, you test it. In the second version of the
model, you train it on, say, the first, second, third, and
fifth groups. Then you test it on the fourth. Then you
train it on the first, second, fourth, and fifth, and you test
it on the third.
Kirill Eremenko: 00:44:38 You're always shifting this window. Ideally, you should be
getting the same results throughout, similar results.
Scott Clendaniel: 00:44:48 They should be really similar. Also, your final model
should take the average of each prediction.
Kirill Eremenko: 00:44:54 Gotcha.
Scott Clendaniel: 00:44:56 There are some folks who just use cross validation for
testing hyper parameters and things like that. If
production allows me to be able to use all five models and
take the average of the scores, that's what I like to do.
Kirill Eremenko: 00:45:15 If you have corrupted data or probably errors in the data,
if you randomize the data before you do the five bands,
wouldn't that corrupt records? Wouldn't they distribute
equally across the five bands, and therefore the model
would still perform identically, but still there would be-
Scott Clendaniel: 00:45:37 Well, it would be identically terrible performance. You're
not just looking to see if it differs among the five. If my
area under the curve is 51%, I've got a problem. I need to
be really aware of that. To your point, yes, they will look
similar, and they will look similarly awful.
Kirill Eremenko: 00:45:59 Gotcha. Gotcha. Could you do the cross validation
without randomizing first? Then you would more likely
have one of those bands that is definitely
underperforming.
Scott Clendaniel: 00:46:12 It's no. I would not recommend that because you lose...
The real advantage of the randomization is the fact that it
eliminates or greatly reduces the chance that you're just
looking that a few records are off, or that you've got
outliers or that type.
Kirill Eremenko: 00:46:29 Gotcha. Gotcha. Awesome. Thank you. Next one was...
This is an interesting question. This one is more about
complexity of machine learning models. What is your
experience... This is a question from Desmond. What is
your experience with the complexity of machine learning
models and alpha or out performance to the most
complex models XGBoost, neural nets yield the highest
alpha, or are there other factors that yield higher alpha?
For example, type of data feature, engineering, et cetera.
First of all, what is alpha?
Scott Clendaniel: 00:47:14 Well, I'll tell you what. I am going to do what all good
interviewees do. When they don't have the pure
understanding of the answer, they change the question.
I'm going to treat this as a question on over fitting as
opposed to the pure definition of alpha. I'm going to
regard over performance as a function of over fitting the
model, which means that it's basically memorizing the
data. As I said, I am older than dirt. When I went to
school and studied math, they used to do this weird thing
where they would give you the answers to the odd
numbered questions in the back of the book, but they
wouldn't tell you the answers for the even number of
questions.
Scott Clendaniel: 00:47:54 What would happen is if you just tried to memorize the
answers, you'd only be 50% right. Models tend to be very
greedy. If they get the chance to memorize the answers,
they will do it. What you want to do is to make sure that
it's very difficult for the model to actually do that.
Otherwise, what it's learned is what the answers to the
specific records that you gave it is as opposed to
identifying the true patterns that can be applied to other
folks. If I went and I got a suit, and it was completely
custom fit but for somebody else, it's not going to fit me
very well.
Scott Clendaniel: 00:48:32 It's over fitted to the wrong person. I want to make sure
that my model generalizes well. I want to make sure that
my model applies to all kinds of different people. Over
fitting is a big problem. The more complex the model,
such as deep learning, if I've got a 600-layer deep learning
neural network, and I've got 10,000 records, I've got a
problem, because in many cases, it's going to try and
memorize the data itself as opposed to learn the patterns.
Algorithm can be a component.
Scott Clendaniel: 00:49:03 You can also have data leakage issues, where it's
memorizing the answer because it's actually included in
the original data set. The ways to avoid that or an answer
to the question, algorithms can be a problem. The data
can be a problem. There are all kinds of things that can
lead you down that path, ways to get around that or to
use very robust validation methods including cross
validation to try and eliminate the possibility of models
actually memorize the answers in the back of the book.
Kirill Eremenko: 00:49:35 Wow. Wow. That's fantastic. I just got transfixed on your
answer. [crosstalk 00:49:43].
Scott Clendaniel: 00:49:43 Oh, sorry.
Kirill Eremenko: 00:49:44 That's awesome. Thank you. Definitely an important point
to look at. At what stage would you say people should
keep that in mind when they're building a model?
Scott Clendaniel: 00:50:01 From the beginning. Part of what we as data scientists,
we tend to be really tempted to jump in and build them
up. I'm a model builder, so I want to build a model. That
might not be the first thing that you do. You really have
to have a solid design in terms of what your model
building process is going to be. Coming up with the
validation strategy should be one of the first things you
do because you've got to segment your data out into
what's going to be test, what's going to be training, what's
going to be validation or cross validation before you start
the modeling process, or you've already contaminated the
experiment, so to speak.
Scott Clendaniel: 00:50:39 In your process design, before you sit down, that should
definitely be a component that you look at.
Kirill Eremenko: 00:50:47 Gotcha. How much time should be spent on designing the
process versus implementing the model?
Scott Clendaniel: 00:50:54 Well, if you come up with a standardized process in terms
of how you select variables and your randomization, you
can actually bake it into the process. Once you do it once,
you shouldn't have to repeat it a whole bunch of times,
but you should definitely be very robust on your first
model and see what components you can redo.
Randomization should be something you should be able
to do in every model in a specific script. You shouldn't
have to reinvent the wheel every time.
Kirill Eremenko: 00:51:21 Gotcha. The follow-up question from Santosh was he
said, "Give an example." Let's say I want to build a
forecasting model for daily pizza sales, and say I have
data for the past model.
Scott Clendaniel: 00:51:36 After my own heart, pizza sales.
Kirill Eremenko: 00:51:40 My question is unless the processes that drive for pizza
sales are consistent, we can't rely on the data. For
example, there was a change in employees every month,
the change in tools being used and so on. On the other
hand, if the pizza store is started just a few weeks ago, it
might not have reached this stage because... The main
thing is how do we know? If there were changes in the
process, you have data for a whole year, but then there
are changes in the process, employees or how we do sales
and so on. Can we still use that for modeling?
Scott Clendaniel: 00:52:15 Sure. What I'll say to that is you're absolutely right, very
important point, but it is very rare that a modeler is ever
going to have a perfect data set to start with. The trick is,
"What is my decision making process now, and can I
improve it with the model, and if so, by how much?" You
never get to perfect. You try and get to a perfect
representation of the system for your pizza prediction, but
every organization has employee turnover. Every
organization has some of those elements in play.
Scott Clendaniel: 00:52:50 That's why you also have to be very careful with your
validation strategy to make sure that your model holds up
on data it's never seen before. That's why keep banging
that drum. The question is, "Is there something that I can
learn from the data that I have right now using standard
validation procedures?" If I can, and I increase the
performance of my decisions, if I make 10% better
decisions, does that help my business? If it does, the
Teddy Roosevelt quote about, "Do what you can with what
you have where you are right now."
Scott Clendaniel: 00:53:26 Make sure that your client, whoever, is going to be using
your model understands. This is what I think I know.
This is what I don't think I know. This is how much I
think I can improve performance with the model over
where you are right now. Then you work with a client to
say, "Is it worth the effort? Is the juice worth the squeeze,
so to speak?" That's why you need to work with your
client as opposed to just turning this into some academic
exercise off in an ivory tower.
Kirill Eremenko: 00:53:52 Nice. Fantastic. Scott, thank you. Those were all the
advanced questions. We're moving on to the world-
Scott Clendaniel: 00:54:02 Okay.
Kirill Eremenko: 00:54:03 You did well. You did really well. We're moving on to more
soft skills and predictions and forecasts for the future. A
question from Muhammad, "What is the difference
between a data scientist and leading a data science
division?" Basically, what is the difference in skills
required to be a data scientist and to lead a data science
division?
Scott Clendaniel: 00:54:33 To lead a data science division, I think you need the skills
of a data scientist plus a couple of other things. One
strategy, making sure that you're leading the appropriate
goals across all modeling projects, not on your individual
modeling projects and management. You need to be able
to work with people to coach them to get the absolute
best performance that they can achieve, not just what you
can do best, and a lot on what you're working on right
now.
Kirill Eremenko: 00:55:01 Are the people skills required for a data scientist such as
communication, presentation? Are there still people
skills? Are they different, and how are they different to
the management people skills required for a lead data
scientist?
Scott Clendaniel: 00:55:15 I don't think that they are different. I think the problem is
people skip over them altogether. The quality of
management in general in most organizations can be
somewhat appalling, not just for people who manage data
scientists, but for people who manage any type of group. I
think it's a real chink in the armor for all types of
organizations. I think the difference would be that you
may communicate to data scientists in their own
language.
Scott Clendaniel: 00:55:43 You must be able to speak their language and be able to
establish their trust and be able to work with them to get
them to their highest performance. If you were dealing
with an accounting team, you need to be able to speak
the accounting language to be able to help them reach
their highest performance. It's not so much different
skills. It's applying the right skills to the type of game you
have, I think.
Kirill Eremenko: 00:56:07 Gotcha. I know you're passionate about leading data
science teams. Why are you passionate about that?
Scott Clendaniel: 00:56:15 Because I think so much can be done outside of just the
algorithm, and I think there has been such a push,
especially in the past 10 years, on the type of algorithm
you use. Algorithm isn't necessarily what's going to lead
you to the best performance. I'm going to steal a story
from Stephen Covey. He said that pretend you had a
bunch of folks, and they're trying to lead a trail through
the jungle. They're like, "Okay, we're going to have Fred in
the front of the group because he's really good at dealing
with a machete, and we're going to have Michelle. She is
the absolute expert at machete sharpening, and she's
really good at that part, and such and so forth and
everything else."
Scott Clendaniel: 00:56:53 The leader is the one who climbs the ladder up and
shouts down, "Wrong jungle." You got to be able to
change. You need to be able to figure out if you're in the
right jungle or not. A lot of managers are not terribly good
at that, and so you need to have that holistic view in
addition to the expertise of data scientists.
Kirill Eremenko: 00:57:17 Fantastic. What advice do you have for data scientists or
advanced data scientists who wants to become leaders or
who wants to become data science managers?
Scott Clendaniel: 00:57:32 It's understanding a fact that all data scientists need to
come to grips with, and that is that data science is not
about data. It's about people. Let me explain what I mean
by that. You're trying to solve people's problems. You're
trying to help people. You're trying to communicate with
people. Whether you're in accounting or data science or
medicine or nuclear physics or social work or anything
else, it's about people. A lot of people come into our field
unfortunately, because they don't really like dealing with
people. They like numbers more.
Scott Clendaniel: 00:58:04 It needs to be a blend. At the end of the day, you're
always trying to help people meet their needs. Data
scientists use data and algorithms and techniques to be
able to achieve that goal, but the goal isn't different. The
more you understand people communication, storytelling,
visualization, the so called soft skills, that is going to be
what greases the wheels to be able to get people to where
they need to be, and to solve their problems in the best
possible way.
Scott Clendaniel: 00:58:35 You can't surrender on that and lead a data science team
and be terribly effective.
Kirill Eremenko: 00:58:42 Love it, so not just data science leadership, but data
science itself is about people. If you one day want to
become a data science leader, then start now. As a data
scientist, start honing in on your people skills.
Scott Clendaniel: 00:58:59 Absolutely.
Kirill Eremenko: 00:59:00 What's the recommendation? How does somebody go
about... There's not many online courses on people skills.
A lot on technical skills. Where do you learn the people
skills?
Scott Clendaniel: 00:59:10 It depends on where you look. I'm actually going to push
back on that one a little bit. We in the data science
community like to read our data science blogs. If you
focus there, that's where you're going to find those
results. There are all kinds of resources. I'll tell you my
particular favorite is the work that was done by Stephen
Covey, and also my book recommendation, I'm going to
sneak in here while you're not looking, which is the Seven
Habits of Highly Effective People.
Scott Clendaniel: 00:59:36 It talks a lot about people skills, talks a lot about problem
solving. That can be applied to data science or accounting
or physics or sociology or anything else, and focusing on
those types of skills. Also, data visualization classes, not
just to show how to visualize data, but you choose a
visualization based on your audience and what that
audience's needs are. Focus on that piece. What do we
need to communicate?
Scott Clendaniel: 01:00:02 What do they need to be able to solve their problem, and
how do I give them back? As opposed to, "Here's my way
manky cool analysis that I did, 700 pages long that no
one's ever going to read." That's not the solution to the
problem.
Kirill Eremenko: 01:00:17 Gotcha. Absolutely. Let's move on to future questions, so
about the future. Snehal asks, "What will be the next
after advanced AI?"
Scott Clendaniel: 01:00:34 If we're not careful, we're going to run into another AI
winter. Let me tell you what I mean by that. If you look at
Gartner's Hype Cycle, where they talk about the stages of
a technology, you tend to get overly hyped expectations,
and then you end up falling off the cliff into what they call
the trough of disillusionment. We can bicker over whether
those are good names or not. But if you set people's
expectations too high, and then you don't meet them,
they don't tend to say, "Scott Clendaniel's particular
model didn't do very well."
Scott Clendaniel: 01:01:07 What they tend to say is, "I knew that AI stuff was a
bunch of hooey, and we never should have invested in it. I
don't want to do a model again. I don't want anyone
coming in here talking about statistics. I don't want to
hear about machine learning. I don't want to hear... It's
all garbage, because Scott Clendaniel's first model didn't
do well." You need to really be aware of that. 85%,
according to Gartner, of all models do not reach
production.
Scott Clendaniel: 01:01:29 Think about that for a second, 85%. Our industry
currently has a 15% success rate. I don't know of too
many fields who can survive that, so my big concern
about AI is unless we get back to actually solving the
organizational issues and fixing the problems as opposed
to, "Gee, look at my AUC. Doesn't it look great?" The
future of AI is going to go into a dark period for a while.
Kirill Eremenko: 01:01:57 What about on the flip side? Jacques asks, "There's a fear
that AI could replace humans in their jobs. What would
you tell a concerned human being about that?"
Scott Clendaniel: 01:02:11 I would look at a lot of the research that's come out of
RPA, robotic process automation. To the largest extent
possible and the conferences I've been to is people's jobs
don't get replaced. In other words, people don't get
replaced. They get different jobs. Meaning that if they're
working in the accounting department, they stop working
with copying stuff from Excel from spreadsheet A to
spreadsheet B, and get to work on, Do we need the
spreadsheet in the first place?"
Scott Clendaniel: 01:02:37 That's not a bad thing. There was a lot of concern that AI
was going to replace all kinds of people, and I haven't
seen a lot of it happen yet. I don't think that radiology is
the first career I would jump into right at the moment,
because a lot of that is being automated, but you may
end up with different types of jobs that a radiologist might
apply like, how you apply the results from lots of different
analysis, from all kinds of different x-rays in terms of
diagnosing a disease.
Scott Clendaniel: 01:03:05 I think some jobs always get replaced by technology.
There aren't a lot of buggy whip manufacturing jobs left
anymore, but I don't see AI replacing all kinds of people.
The head of Stanford's AI lab had a great quote that said,
"We're a lot closer to discovering a smart washing
machine than we are of terminators taking over the
world." I think that's true. I think we need to be careful of
it, but I think people are perhaps overly concerned at this
point in time that AI is going to replace everyone's job.
Kirill Eremenko: 01:03:42 Gotcha. There was a report by the World Economic
Forum in 2018 that predicted that I think by 2025 or
2022, I'm not sure exactly of the year, but AI will displace
75 million jobs worldwide, whereas, it will create 133
million jobs. That's a coefficient of 1.7.
Scott Clendaniel: 01:04:07 I think that's a much better example than the one I just
gave.
Kirill Eremenko: 01:04:13 I think they're both absolutely valid. What are your
thoughts on AI replacing data scientists themselves,
specifically auto ML and products like data robot?
Scott Clendaniel: 01:04:25 I need to be careful here. I need to choose my words
correctly in terms of that.
Kirill Eremenko: 01:04:29 We can skip that question. That's okay.
Scott Clendaniel: 01:04:32 I think that a lot of AI functions can be helped by
automation. I think ensembling, I think picking the
correct algorithm, I think hyper tuning parameters, I
think a lot of those will become automated, but there's a
lot of room for creativity on the feature engineering side.
There's a lot of room for creativity that's hard to replace.
Even something as simple as ratios, models are terrible at
calculating ratios. They just are.
Scott Clendaniel: 01:05:02 For example, if you think of a credit score, if you think of
everything I currently owe as one input, not a great
predictor. If you think of how much total credit I have
available, if you add up the credit lines from all my credit
cards and stuff, also, not a very good predictor by itself. If
you use a classic algorithm to go and say, "Okay, let's
throw them both out, because they don't have high
correlation to my result." The trick is the percent of
utilization, so out of that big pile of credit, what
percentage am I using is hugely predictive in terms of
your credit score.
Scott Clendaniel: 01:05:35 It's those types of things as simple as ratios that I think
are going to be hard to automate away. I think that many
functions may be assisted by automation even in data
science. I think if we focus on the right skills and the
problem solving aspects and those type things, it is less
likely to be automated away at least in the short term.
Kirill Eremenko: 01:06:00 Gotcha. Thank you, very, very cool answer. Adly asks,
"Does every business need to adopt AI?"
Scott Clendaniel: 01:06:11 No is the short answer. [inaudible 01:06:16] not every
business does. I think it's silly for us to attend that every
business in the world needs AI. I think every business
could use to make better decisions than they do right
now, and to the extent that AI helps with that, great. To
the extent that AI doesn't help with that, no. Also, there
are some businesses that don't have a lot of good data. If
you don't have good data, you can't really build great
models, so AI isn't going to be a particular help.
Scott Clendaniel: 01:06:45 I think that every business needs to make better
decisions, and businesses that have access to good AI
should take advantage of it. Those that don't, don't worry
about it.
Kirill Eremenko: 01:06:57 What are your thoughts on Andrew Ng's comments that
AI is the new electricity, and similar to how 100 years
ago, only 50% of the U.S. was electrified? Now, everything
uses electricity. AI will also similarly but faster be
adopted by virtually all organizations. Otherwise, they'll
be lose in terms of competitive pressure. What are your
thoughts on that, with the thought in mind that not every
business needs AI?
Scott Clendaniel: 01:07:25 Well, let's follow that example through. You used to have
organizations in the 1920s who had a CEO, but it wasn't
a chief executive officer. It was a chief electricity officer,
whose sole responsibility was how to figure out... I don't
know a whole lot of organizations that are still hiring chief
electricity officers. I think that better decision making,
again, is the key more than AI itself. I think it's going to
help more and more industries.
Scott Clendaniel: 01:07:55 I just don't think that you're going to have VIKI from
iRobot making all the decisions for the planet. I think that
is an overblown fear. I think it's going to impact more and
more organizations, but I think that we tend to go to this
pendulum. "No AI. It doesn't help anything," to, "AI solves
everything." The answer tends to be somewhere in the
middle, and just be aware of that.
Kirill Eremenko: 01:08:25 Gotcha. Understood. One final question, this one will be
from me. There are so many types of ways to structure
your data science division. This is like a data science
management style question. One is to integrate individual
data scientists across different functions like sales and
then finance and operations and so on. Another one is to
have a centralized data science team, who service all
those functions. What is your preferred style and why?
Scott Clendaniel: 01:09:04 I'm going to steal this one from Harvard Business Review.
That's to use a hub and spoke model. You have a central
core of folks who help the rest of the organization work on
data science project. These are the folks who are going to
be making sure that folks have the right tools that they
need to help establish some processes, some standards,
and so forth. That team is very small, and that most data
scientists sit in the individual group that they need to
serve.
Scott Clendaniel: 01:09:31 Your hub supports the people and the spokes in the
different departments, and help them achieve their goals.
I think that is the best way to do it. It is so easy to have
folks be in a data science group who were out of touch
with the needs of their clients that's actually making
them physically sit in that organizational structure helps
solve a lot of those needs. That's the way I would do it.
Kirill Eremenko: 01:09:58 Wow. Fantastic. Thank you. Scott, it's been amazing.
We've actually gone over time, but it was totally worth it.
Loved these questions and your answers. Before I let you
go, before we wrap up, I wanted to ask you what... Do you
have a recommendation or just some piece of advice for
specifically advanced data scientists out there who are
listening to this podcast, so any parting thoughts?
Scott Clendaniel: 01:10:26 I do. That is that... I'll tell it through another story. The
first time I was ever invited to participate in an AI
conference, I went running into my co-worker's office. Her
name was Beth. I said, "Beth, it's fantastic. I've been
invited to speak at an artificial intelligence conference.
Isn't that great?" She folds her arms across her chest. She
leans back in her chair. She raises one eyebrow and says,
"Do you really feel qualified to speak at such a
conference?"
Scott Clendaniel: 01:10:57 I was like, "I did 10 seconds ago." There is a lot of folks in
our industry who look like Beth. That's a bad idea. We
need to be inclusive. Get down off your high horse. It's a
technology. It's an area of expertise. We need to be
inclusive, not exclusive. Try and be nice to people. Try
and help them achieve their goals. Common manners and
being polite and listening to people are really important in
any field. But if you're hoping for AI to have a big impact
in your company, trying to prove how smart you are and
how unsmart they are is a really bad idea.
Scott Clendaniel: 01:11:37 There are way too many of us who do that. Be inclusive.
Incorporate as many people as you can. Be as helpful as
you can, and stop doing this approach that, "I am some
type of God of intellect because I know how to build a
model." You can build a model and tell folks. A lot of
people could do that if only would someone take the time
to show them how to do it. Be the person who's helping
bring more people into the fold, not explaining to
everybody else why they're wrong.
Kirill Eremenko: 01:12:07 That's amazing advice. You actually walk the walk and
talk the talk, right? That's the saying. You live by that
yourself.
Scott Clendaniel: 01:12:17 Thank you.
Kirill Eremenko: 01:12:19 I look at your comments on LinkedIn, and you're always
there supporting people, answering questions every time.
Even in this thread, people asked you a question, you'd
not just say, "Ask the question and thank you," but you
actually put a little image of a thank you, a written thank
you. Every time, a different one. I could just imagine you
have a whole library of these that you can use at any
given time. It's really cool. Why do you do it? Why do you
help the community so much?
Scott Clendaniel: 01:12:48 I think because I was treated so poorly by the experts in
our field when I tried to break into the field. I like to tease
with people that for the first half of my career, people told
me I couldn't do this because I didn't have a PhD in
statistics. The second half of my career, everyone tells me
I can't do it because I don't have a PhD in data science. I
was like, "But some of my models seem to be working
pretty well, I don't know." I think that it's just a way of
bringing more people in and being helpful because people
need encouragement.
Scott Clendaniel: 01:13:22 We've got enough people out there in the world telling
everyone else that they're wrong. I think a little bit of
kindness and support to other people goes a long way. I
think we're in better shape if we all just treated one
another with a little more respect, a little more kindness,
and a little less roughness and a little less intellectual
aloofness.
Kirill Eremenko: 01:13:42 Awesome. Fantastic. Well, Scott, thank you.
Scott Clendaniel: 01:13:44 I don't want anybody else to have a three-year-old on the
phone saying, "Did you pack my toys yet?" with no
prospects of finding a new job?
Kirill Eremenko: 01:13:54 Thank you very much for sharing that. I think it's-
Scott Clendaniel: 01:13:57 Thank you. I really enjoyed this.
Kirill Eremenko: 01:14:00 Awesome. Me too. Tell us how can people get in touch,
follow you, connect with you?
Scott Clendaniel: 01:14:07 The best way is to follow me on LinkedIn at T.Scott
Clendaniel. I can't accept all the invitation requests
because I'm almost at the 30,000 limit. They won't allow
more people in, but please follow me. If you have
questions, send me a message. I can't answer everybody,
but I do my best. I think I answered, gosh, over two dozen
questions in the existing forum, and I will continue to try
and do that.
Kirill Eremenko: 01:14:30 Fantastic. Thank you very much. You already gave your
book recommendation. Could you just remind us? I think
it was Seven Habits of Highly Effective People.
Scott Clendaniel: 01:14:38 Seven Habits of Highly Effective People by Stephen Covey,
who is no longer with us-
Kirill Eremenko: 01:14:44 Awesome. That's-
Scott Clendaniel: 01:14:44 ... but his legacy lives on, great advice in there.
Kirill Eremenko: 01:14:48 Gotcha. Wonderful. On that note, once again, thank you
so much. We'll share all the links in the show notes, and
please guys and everybody listening, connect with Scott.
This has been a great opportunity to have you on the
podcast. Thank you for coming.
Scott Clendaniel: 01:15:07 Thank you. Take care.
Kirill Eremenko: 01:15:14 There we have it, everybody. Hope you enjoyed this
conversation as much as I did. I learned a ton from this
discussion. There were so many cool advanced things
that I didn't know about before, and just blown away.
Thank you so much, Scott, for coming on the show, and
sharing all these insights with us. Perhaps my favorite
part was when we spoke about over sampling the
minority class. I could feel Scott's confidence, and quite a
tricky technique to just throw away a lot of your data in
order to make sure that the positives and negatives are
equally roughly the same.
Kirill Eremenko: 01:15:48 It's a difficult decision to make, but the confidence which
he spoke of was clear that he's done this speaking many
times. It's obviously works for him. I really liked the
discussion about data science leadership, and that if you
want to be a data science leader one day, start now
because soft skills are going to be... You need soft skills
as a data scientist, not just as a lead data scientist. There
we go.
Kirill Eremenko: 01:16:12 As usual, you can get the show notes at
superdatascience.com/385. That's
superdatascience.com/385. There, you can get the
transcript for this episode, any materials we mentioned,
including a URL to Scott's LinkedIn. Make sure to
connect with him, or just look him up on LinkedIn. It's
T.Scott Clendaniel. He's always, always very helpful. Just
recently, he shared some amazing cheat sheets for
machine learning. Even that is worth checking out.
Kirill Eremenko: 01:16:45 I had a look at them. This was some cheat sheets that
was shared by Stanford University. He shared them on
his LinkedIn, and there are some really cool cheat sheets
there, including around cross validation. Check that out.
As always, if you enjoyed this episode, share it with
somebody, especially if you know an intermediate data
scientist who's looking to become advanced or an
advanced data scientist who wants to further their skills
in the space, a colleague maybe you know, a friend, a
family member.
Kirill Eremenko: 01:17:17 Send them this episode, very easy to share. Send them
the link superdatascience.com/385. On that note, my
friends, thank you so much for being here today for
sharing this hour or just more than an hour with us. I
hope to see you back here next time, where we will be
continuing to deliver on the promise of amazing episodes
with very interesting, incredible guests. Until then, happy
analyzing.