SDS PODCAST EPISODE 385: ADVANCED DATA TOPICS AND PEOPLE-CENTERED DATA SCIENCE · 2020. 7. 22. · or an advanced practitioner in data science. Kirill Eremenko: 00:01:18 You're interested

SDS PODCAST

EPISODE 385:

ADVANCED DATA

TOPICS AND

PEOPLE-CENTERED

DATA SCIENCE

http://www.superdatascience.com/385

Kirill Eremenko: 00:00:00 This is episode number 385 with Lead Data Scientist,

Scott Clendaniel.

Kirill Eremenko: 00:00:12 Welcome to the SuperDataScience podcast. My name is

Kirill Eremenko, Data Science Coach and Lifestyle

Entrepreneur. Each week, we bring you inspiring people

and ideas to help you build your successful career in data

science. Thanks for being here today, and now let's make

the complex simple.

Kirill Eremenko: 00:00:44 Hello everybody, and welcome back to the

SuperDataScience podcast. Super excited to have you

back here on the show. Today, I've got a super amazing

treat for all of us. I just got off the call with Scott

Clendaniel. Scott is a lead data scientist there. He has a

huge amount of experience in this space of data science

and machine learning, and he's always happy to give back

to the community. This podcast is going to be an

advanced podcast. It's specifically going to be useful to

you if you are an intermediate practitioner in data science

or an advanced practitioner in data science.

Kirill Eremenko: 00:01:18 You're interested in things like models, cross-validation,

over sampling and things like that. With that said, here

are some of the topics that you'll hear about today. You'll

hear about Scott's story and how he got into the space of

data science. We'll talk about fraud detection because it's

a big part of the financial services industry. We'll talk

about some specific examples of ways to detect fraud,

including Benford's law. We'll talk about oversampling,

the minority class, the multiplicity of good models, the

tools that Scott uses on a daily basis.

Kirill Eremenko: 00:01:51 We'll talk about data preparation techniques. Specifically,

we'll talk about target, mean and coding, and one hot

encoding, what they are, which one is better, and why.

Then we'll talk about model drift, and we'll discuss why

models decay over time. Scott will give his


recommendations on how often he checks up on models.

We'll talk about building populations to build to report.

We'll talk about some real world examples, cross-

validation. Then we'll cover off Scott's advice on some of

the softer skills like data science leadership, what it

means to manage data science teams, how to best

structure a data science team, the hub and spoke model

for that.

Kirill Eremenko: 00:02:28 We'll get Scott's ideas and visions for what is coming for

data science in the future. A very exciting podcast coming

up ahead, I can't wait for you to check it out. Without

further ado, I bring to you Scott Clendaniel, lead data

scientist at Franklin Templeton.

Kirill Eremenko: 00:02:51 Welcome back to the SuperDataScience podcast,

everybody. Super pumped to have you back here on the

show. Today, we've got a super special guest joining us,

Scott Clendaniel. Scott, welcome.

Scott Clendaniel: 00:03:02 Thank you so much. I'm really happy to be on.

Kirill Eremenko: 00:03:05 That's awesome. I forgot to ask you, where are you

located right now?

Scott Clendaniel: 00:03:09 I am actually in Havre de Grace, Maryland off the

Chesapeake Bay, about 45 minutes north of Washington,

DC.

Kirill Eremenko: 00:03:16 Havre de Grace, Maryland.

Scott Clendaniel: 00:03:20 Absolutely. [crosstalk 00:03:23].

Kirill Eremenko: 00:03:23 Havre de Grace is the name of the city.

Scott Clendaniel: 00:03:26 Yes. Port of mercy, I believe, is the loose translation.


Kirill Eremenko: 00:03:32 Very interesting. How did you end up there? Have you

been there for a long time?

Scott Clendaniel: 00:03:38 Actually, we just moved here. We had an opportunity

because my job allows me to work remote to be able to

change location. My wife as a children's librarian just got

a position up here near Cecil County. We just moved

here, and we're pleased as punch. It's really pretty.

Kirill Eremenko: 00:03:57 All right, tell us... Before that, where were you located

before that?

Scott Clendaniel: 00:04:03 I was actually born in Baltimore, Maryland, lived there for

a long period of time. Then I was in Delaware,

Pennsylvania, and a few years in Honolulu, Hawaii.

Kirill Eremenko: 00:04:15 Amazing. I've always wanted to go to Hawaii. What is

Honolulu like?

Scott Clendaniel: 00:04:20 Honolulu is really a fantastic city. I really loved living

there. Obviously, a lot of tourism, but the people there are

just so warm and inviting, really had a good time there, a

big fan of Hawaii.

Kirill Eremenko: 00:04:38 What Island is Honolulu on?

Scott Clendaniel: 00:04:39 Oahu, which is the main Island. About three quarters of

the population lives on that island. There's the total

[crosstalk 00:04:48].

Kirill Eremenko: 00:04:47 Wow. I heard there's a great restaurant on Maui, and it's

called Mama's Fish House, I think. Do you know that?

Scott Clendaniel: 00:04:56 No, I don't know that one. I haven't been there, but I

visited four of the eight islands while I lived there. Maui is

very pretty. Each island has its own personality, which is

fun.


Kirill Eremenko: 00:05:10 What's the differences in personality?

Scott Clendaniel: 00:05:10 Well, Kauai is the garden island, so it tends to be much

more laid back. It is probably the greenest of the islands.

That's great. The big island has all kinds of different

topography. It's probably the only place in the world

where you can go snow skiing and water skiing on the

same day.

Kirill Eremenko: 00:05:28 Wow.

Scott Clendaniel: 00:05:29 Each island has its own unique factors to it. Maui's a lot

of fun too. Oahu is where most people go. That's where

Waikiki is. That's probably the most popular of the group.

Kirill Eremenko: 00:05:41 Oh, fantastic. Very cool. The mountains are tall enough

for skiing?

Scott Clendaniel: 00:05:46 Only on the big island. When I say snow skiing, I'm not

talking about Aspen or Vail, Colorado. I mean, you can

get about that, but you can at least tell people, "Hey, this

is fantastic. On the same day, I went water skiing and

snow skiing."

Kirill Eremenko: 00:06:02 Amazing. Okay. Got you. Well, Scott, it's a pleasure to

have you here. We've got a lot to go through. For those

who are listening and maybe don't know, I posted on

LinkedIn just, "Hey, Scott Clendaniel is coming to the

show in 24 hours, and post your specific asks for

advanced data science questions." In that 24 hours,

there's now 56 messages in there. Thank you very much

for taking the time to really answer to some people.

There's been a lot of cool discussions and [crosstalk

00:06:33].

Scott Clendaniel: 00:06:32 Absolutely.

Kirill Eremenko: 00:06:34 I really want to go through [crosstalk 00:06:36].


Scott Clendaniel: 00:06:35 Very few of them were from my bill collectors, which I

really appreciated.

Kirill Eremenko: 00:06:41 Gotcha. What I want to start with though, so we'll

definitely get to those, and there are some really cool

advanced machine learning questions, model and cross

validation and things like that. Before we get there, give

us a bit of a background around yourself. Who is Scott

Clendaniel?

Scott Clendaniel: 00:07:02 Sure. Actually, I have a bit of an unusual background. My

undergraduate, my MBA are both in strategic planning.

They're not in either statistics or computer science, which

makes it relatively different from a lot of other folks in the

field. I was in financial services for a long period of time,

and all the way up through vice president of consumer

lending at Bank of Hawaii. Unfortunately, I had a family

emergency where my ex-wife at that time decided to take

my son, and so I had to leave.

Scott Clendaniel: 00:07:33 She had indicated to my son who was only three years old

at that time that I had actually gone, because I was

packing his toys in Hawaii. I was like, "Why did you do

that?" Anyhow, so I got a phone call from my three-year-

old son one day. He said, "Daddy, are you done packing

my toys yet?" It was the worst thing ever. I was like,

"Gosh, I've got to figure out how to give up my career in

financial services and do something else."

Scott Clendaniel: 00:08:00 I thought to myself, "I wonder if anyone's interested in

this machine learning artificial intelligence stuff. I wonder

if I could do that." For the next 16 years, I became a

consultant. To all those folks out there who are trying to

break into data science, if I can get past that, you can get

past whatever you're facing currently. I encourage

everybody to give it a shot.


Kirill Eremenko: 00:08:23 Wow. Wow. What a story. You just packed your stuff,

gave up a vice president position at Bank of Hawaii, and

moved back to mainland U.S. How did that go for you?

Scott Clendaniel: 00:08:36 That was rough [inaudible 00:08:38], but it opened up a

world of opportunity. It also gave me a whole new

perspective on what's important in life and what isn't, and

also gave me a lot of focus on persevering and problem

solving and how I could add value to others. It was a

tough thing to go through, but it provided a lot of

opportunities for me later on in life.

Kirill Eremenko: 00:09:03 Wow. Wow. Amazing. 16 years past. You were consulting

in this space. What happened next, and where are you

now?

Scott Clendaniel: 00:09:16 Absolutely. Morgan Stanley actually recruited me to be

their first full time data scientist in their Baltimore office.

That was great. Unfortunately, they had a situation with

some internal fraud, where... I can share this because it

was on the front page of the New York Times. Someone

walked out the front door with $11 million. They suddenly

changed my role to focus exclusively on internal fraud,

which wasn't really what I wanted to be doing at that

point in time.

Scott Clendaniel: 00:09:46 I wanted to stick with machine learning. I was recruited

away from Lake Mason, and had been there for two and a

half years. We just got purchased by Franklin Templeton,

so I've been trying to help build functions for those

organizations.

Kirill Eremenko: 00:10:02 Just to give a bit of background on the Franklin

Templeton, this is one of the world's largest global

investment firm.

Scott Clendaniel: 00:10:11 Yeah. Once the transaction is completed, which will

probably be about the time this podcast is released, it'll


be approximately $1.4 trillion in assets making us the

sixth largest in the world.

Kirill Eremenko: 00:10:23 Wow. That is crazy. As a lead data scientist at this

company, what is your role like? Do you actually look at

how to invest this money, or are you looking for fraud, or

is it like a broad scope of locations? What exactly do you

do?

Scott Clendaniel: 00:10:41 Most asset management firms have separate groups who

do the actual portfolios of investments, so I'm not

involved in that so much. I help working on other types of

business problems like optimizing sales, trying to help in

profiling customers and opportunities, occasionally some

small pieces of fraud detection, and actually trying to

educate the organization as a whole on best practices and

analytics and trying to make sure that we can meet the

academic component of what's going out in the world

versus the real world, and trying to bring people up to

speed, do a lot of training for folks, helping them form

their own business plans and helping them build their

models. It's our reaching.

Kirill Eremenko: 00:11:28 Gotcha, so quite a broad scope of applications, but not

specifically the investment management. Interesting. Very

interesting.

Kirill Eremenko: 00:11:40 I hope you're enjoying the podcast, and we'll get straight

back to it after this quick announcement. This

announcement is going to be a bit tough for me because it

is about my own book, so please excuse the shameless

plug. However, I do believe in it so much that I want to

get the word out there. This book is designed in a way to

get anybody and everybody up to speed with data science,

pretty much everything that is important that is needed

to get going.


Kirill Eremenko: 00:12:05 The unique proposition of this book is that it doesn't

require coding. There's a lot of books out there on data

science where you need to sit in front of the computer

and code in Python or R. This book, you simply take and

you can read it on your lap, in a car, in a plane, in your

backyard, on a couch. You can read anywhere. There is

no coding in the book. It focuses on intuition. If you've

taken our course, then you'll like those intuition tutorials

about how an algorithm works and why rather than what

the code behind it is, then you're going to love this book.

Kirill Eremenko: 00:12:41 It's going to be a great way to solidify that knowledge. It

covers pretty much everything in a data science project

lifecycle from asking the right questions to data

preparation, to machine learning, to visualization and

finally presentation. Pretty much everything you need in

your career is covered. If that sounds exciting, check it

out. It's called Confident Data Skills available on Amazon.

It's the data science book with the purple cover, and

please enjoy.

Kirill Eremenko: 00:13:11 Well, on that note, I think let's dive into these questions

because there's quite a lot to go through that, and I think

that'll take us-

Scott Clendaniel: 00:13:23 Let's go.

Kirill Eremenko: 00:13:23 All right, awesome. I've gone through the questions. This

was on LinkedIn. I've sorted them in order of... We're

going to start with the most advanced machine learning,

AI, deep learning stuff first.

Scott Clendaniel: 00:13:34 Fine. No pressure on me. That'll be great. Okay, go ahead.

Kirill Eremenko: 00:13:38 Here we go. Vighnesh, if I'm pronouncing that correctly,

Vighnesh asks, "Can you give us a brief about the real

world applications of data science in the investment


industry? How do you approach a particular problem in

this space?"

Scott Clendaniel: 00:13:58 Sure. I think one of the big components in any field is

actually trying to make sure what is the business problem

and defining that first before you actually define your

modeling approach. In investment management, there are

a bunch of different problems where data science can be

applied, and also, financial services have been involved in

advanced analytics since at least the 1960s, so it's a great

field to be in. A couple of examples would be fraud

detection. How do you tell whether a particular

transaction is someone spoofing, whether they're the real

person or not?

Scott Clendaniel: 00:14:31 There's developing portfolios in which assets should be in

there. You can look at time series forecasting of how an

individual investment is going to perform. You can also

look at things. One of the problems I've been working on a

lot recently is sales optimization, so how does a financial

advisor look at the broad palette of customers and

potential customers, and figure out who should be

prioritized in terms of what their needs are and coming

up with a product recommendation on what would fit

their needs?

Kirill Eremenko: 00:15:05 Gotcha. It ties in well with what we just spoke about that

there is a broad scope of applications in business

problem. Gotcha. What specific AI technologies have...

This is Matthew. Matthew asks, "What specific AI

technologies have changed the investment industry, and

which do you predict will shape the industry in the next

five years?"

Scott Clendaniel: 00:15:28 Sure. I think the development of additional algorithms to

be available to us has changed AI quite a bit. Deep

learning, perhaps not as much as others, but things like

xgboost and algorithms that allow for ensembling have


really helped the industry quite a lot. Also, approaches in

terms of anomaly detection for fraud detection, they've

been huge contributors as well. Those are probably the

changes in AI that have impacted the most.

Scott Clendaniel: 00:16:00 Also, the growth of open source have made it very difficult

for organizations to say no. There was a time many, many

decades ago where people would say, "Oh, no, I'm sorry.

We can't do anything without a software package that

costs $90,000." Now, they can't say that anymore. I think

that actually had probably the biggest impact on the

growth of AI overall.

Kirill Eremenko: 00:16:23 Gotcha. Gotcha. In the next five years?

Scott Clendaniel: 00:16:27 Next five years, I think there are going to be huge

opportunities in terms of predicting credit performance

and also fraud detection. Those are extremely difficult

problems to solve, and having the more advanced AI

technologies, I think, are going to continue to help in that

arena, especially fraud detection, because it keeps

changing. What appears to be fraud in one given quarter

may look very different the next quarter, because the

fraudsters are always adapting and changing their

approaches.

Scott Clendaniel: 00:16:58 So you need to have a technology that allows the models

to continue to grow over time. You can't just pick a point

in time and say, "Okay, we know what fraud is," because

it won't be the same next quarter.

Kirill Eremenko: 00:17:09 Gotcha. Before I forget, I wanted to say that Scott is

sharing his comments today on behalf of himself and not

on behalf of the organization that he works at. These are

just opinions at the end of the day, our opinions.

Scott Clendaniel: 00:17:27 It's all my fault. I want to make myself really clear.

Everything I say is my fault. No one else's.


Kirill Eremenko: 00:17:32 Thank you. Thank you, Scott. Speaking of fraud, do you

know what the size of this problem, globally or in the U.S.

is per annum?

Scott Clendaniel: 00:17:47 I don't have recent statistics on that, but it runs into the

several billions of dollars. The challenge is the fact that

you not only lose the profit on the given transaction that

would come in if it turns out to be fraudulent, but you

lose the entire dollar amount. In financial services, your

inventory is actually the dollars that you manage. If you

have a fraudulent transaction, you lose every bit of that

inventory along with any type of profit you would have

made.

Scott Clendaniel: 00:18:18 It runs into many, many billions of dollars, so it is a huge

issue. It's also really complicated because the fraud rate,

the percentage of transactions that are fraudulent tend to

be very low, but its financial impact is ridiculously large.

It's a real class and balance problem.

Kirill Eremenko: 00:18:36 I did a quick search. The global fraud market size for

fraud detection and prevention market size is valued at

$17.3 billion. As you said, lots and lots [crosstalk

00:18:51].

Scott Clendaniel: 00:18:51 That's just the market to stop the fraud. That's not the

product sales. You've got a real clear picture of how big of

an issue it is.

Kirill Eremenko: 00:18:58 That's a good point. I only know one... saw quite...

Intuitively, I know what... I've read about one specific

broad algorithm that I could confidently explain, and

that's Benford's law. Are there any algorithms that you

can share with us?

Scott Clendaniel: 00:19:23 Sure. Actually, I'll make a recommendation of something

to be careful with. There has been a huge amount of

press about using anomaly detection for fraud detection.


That is very helpful. It does have some pretty severe

limitations though, and that, a given fraudster is going to

try very hard to not look like an anomaly. In other words,

to some extent, the data is actually fighting against you.

The fraudster is trying to look as similar as possible as

they can to the mean of any given transaction.

Scott Clendaniel: 00:19:57 They're actively fighting you to not look like an anomaly.

The problem is the false positive rate on anomaly

detection is enormous, and it's very difficult to fight

against. Just using anomaly detection, regardless of how

sophisticated the version that you're using is, tends to

have some severe limitations that you're going to come up

against, so just be aware there should be one tool in your

arsenal. It shouldn't be the be all and end all.

Kirill Eremenko: 00:20:28 Gotcha. Anomaly detection is one. That's fantastic. I'll

share Benford's law. Benford's law is more of a aggregates

tool. It doesn't look at individual transaction. It looks like

as a whole. The way we were taught it at Deloitte is that if

you take a... I might be paraphrasing what they told us

back then. Again, I'm speaking from my opinion as well.

You take all the transactions on all of the dollar values on

a balance sheet of a company.

Kirill Eremenko: 00:20:59 Just take all of them. Mix them all up. Put them into a

bag, and then look at the distribution of the first... Is it

the first? No, of the first digit in all of these transactions.

What is the first digit of all of these amounts? What's the

first digit? How many number ones are there? How many

number twos are there? How many number threes are

there? The leading digit, and Benford's law says that the

distribution should be... They should be 30% once, 17%

twos, 12% threes, 9% four.

Kirill Eremenko: 00:21:40 It drops off as you go further. It's an intuitive thing, and

that is something that is really hard to fake, right? If

you're faking a balance sheet and you're making up


numbers. You don't think of that distribution in mind.

You might make your numbers look really believable, but

overall, when you take the distribution on the first digit

across all the numbers that are going to fall on Benford's

law. That's how a qualified expert can tell that, "Hey,

there's something going on here."

Scott Clendaniel: 00:22:10 In forensic accounting, that becomes really important.

That is definitely one of the warning signs. There are a

couple of interesting things about Benford's law in

addition to what you said. I think you gave a great

explanation of it. Benford's law seems to apply even if you

change the base unit. If it's not a decimal system, if you

used a base eight system, you will tend to see very similar

patterns. The percentage of digits will change, but I just

find it amazing that even if you change the base number

that you're using, that it tends to show up.

Scott Clendaniel: 00:22:45 It's great for things like reviewing the balance sheet, just

as you mentioned. What becomes tricky is the fact that

when you're dealing with consumer transactions, for

example, the actual transactions, you don't apply as

much for the trailing digits. Everybody wants to charge

$1.99 or $1.95. They don't tend to charge $1.03. You

actually have the opposite problem there, so you have to

be careful. Benford's law is extremely powerful when

you're looking at a whole collection of numbers given in

one instance such as the balance sheet. It becomes much

harder when you're trying to look at individual

transactions.

Kirill Eremenko: 00:23:24 Awesome. Awesome. Thank you. Well, there we go. That's

two techniques in fraud detection. Speaking of fraud

detection, we have a question from, again, Vighnesh who

says, "While working on fraud detection problems, most of

the times, we come across imbalanced datasets. Can you

please put a light on how to overcome such problems or


how to resolve it?" Maybe to start off, what does it mean

that the data set is imbalanced?

Scott Clendaniel: 00:23:52 Sure. This is typically referred to as class imbalanced. In

other words, if I'm trying to do a classification problem,

and let's say that I'm trying to do fraud versus non-fraud.

If you look at the distribution of how many transactions

are fraud and how many transactions are non-fraud, your

fraud rate tends to be relatively small down into the tens

or hundreds of a percent. The problem is if you try and

compare the fraud transactions versus the non-fraud

transactions, a lot of algorithms are going to choke on

that unless you adjust the balance of the dataset so that

you can have more of a 50/50 ratio between fraud and

non-fraud.

Scott Clendaniel: 00:24:33 Otherwise, what the model's going to do is go, "Let's see.

I've got 1,000 transactions. 990 are not fraud, and 10 are

fraud. I've got it. They're all not fraud." It's going to be

right 99.9% of the time. It's just going to mess everything

up, so you need to make [crosstalk 00:24:52].

Kirill Eremenko: 00:24:51 It's going to have fantastic markers.

Scott Clendaniel: 00:24:54 Yes. That's amazing. I'm done. I just said there is no

fraud, and I'm going to go home and have lunch. That

doesn't work out too well in the real world. What you do is

you tend to over-sample what's called the minority class,

so in this case, the fraud transactions. I might take most

of every fraud transaction I can get that is in my training

set. I'm going to compare it against an approximately

equal number of non-fraud. That means that the model

can't just arbitrarily say, "Okay, everything is not fraud."

Scott Clendaniel: 00:25:27 That's the technique that I use the most. There are other

techniques that you can use including doing all sorts of

complicated things with synthetic data or smoke

techniques or those types of things, but the over sampling


the minority class, lots and lots of the fraud and much

fewer of the non-fraud has been the technique that

worked the best for me. It's also very simple.

Kirill Eremenko: 00:25:50 What's the drawback? What's the potential danger of

using this technique?

Scott Clendaniel: 00:25:57 Part of the problem is you may not have enough fraud

cases to use in the first place. You may have such a small

number of records that you may not be able to use that in

its purest form, but you're definitely going to want to

move your sampling as close to a 50/50 ratio as you can.

Kirill Eremenko: 00:26:17 Gotcha. As long as you select the other ones, the non-

fraud ones at random, you shouldn't have any bias in

your model because you didn't use all of the available

random.

Scott Clendaniel: 00:26:29 Correct. Use of random becomes really important. Some

of the experts in the group who have classic statistical

background can talk a lot more about random versus

non-random and that there is no true random, but there

are all sorts of techniques you can use to make sure. For

most of the work that I've done, I just use random

functions, Python or SQL. That has worked pretty well for

me, and I haven't run up [inaudible 00:26:55] stations.

Kirill Eremenko: 00:26:57 Awesome. Speaking of Python and SQL, what kind of

tools do you use in your day to day?

Scott Clendaniel: 00:27:06 Python, Spark are the most common. Because I am older

than dirt, I started doing this all the way back in 1986,

God forbid. I actually grew up using gooey tools like

SPSS, which is now by IBM, Salford systems, which is

now owned by Minitab, lots of other gooey tools. There's

actually a free one, especially if you're just starting in the

field and you don't have developer background, called

Orange Data Mining, which is included in the Anaconda


distribution. Those were a couple places that you can

start, but eventually, it tends to turn into a lot more

Python as a starting point, and probably Spark if I'm

using a distributed system to build.

Kirill Eremenko: 00:27:55 Gotcha. You said Orange Data Mining.

Scott Clendaniel: 00:27:58 Orange Data Mining, yeah. It is not the prettiest program

you will ever encounter. The interface looks at least about

20 years out of date. Don't be fooled by how the interface

looks, because there is a lot of power underneath it. A lot

of people get turned off. They're like, "Oh, this doesn't

look cool. This doesn't look like something that was built

for Apple. I'm not going to use it." I would say that's a

mistake. They've actually done a really good job of

creating a gooey interface to sit on top, and then

underneath, they're primarily as Python. [crosstalk

00:28:29].

Kirill Eremenko: 00:28:29 Is this primarily for building models or fraud detection?

Scott Clendaniel: 00:28:36 Any type of model.

Kirill Eremenko: 00:28:38 Gotcha. Awesome. Awesome.

Scott Clendaniel: 00:28:40 I've been really happy with it. People laugh at me when I

show the screenshots, but it actually works pretty well.

Kirill Eremenko: 00:28:46 Gotcha. What kind of models have you noticed work the

best with fraud detection? We've got a huge range. K

means clustering. We've got Naïve Bayes. We've got

logistic regression. We've got xgboost, and so on. What

would you say are your go-to models? When you have a

fraud-detection problem, what's your first, second, third

choice?

Scott Clendaniel: 00:29:12 First of all, I'm going to throw out an old theory that I

hope people look into, which is essentially called the


multiplicity of good models, which means if you have the

right data and you've prepared it the right way, all sorts

of algorithms are likely to give you a very similar, positive

result. If you haven't done the data prep correctly, you

will start to see wide variances. That being said,

ensembling techniques of any type would be my favorite,

probably the most common being XGBoost.

Scott Clendaniel: 00:29:44 Also, as a data prep technique, I recommend target

meaning coding. That can be extremely helpful. In terms

of the final technique, I always recommend ensembling

because each algorithm has its own strengths and

weaknesses. If you'll ensemble a group of different models

together, you're likely to end up with a better result. The

final model is usually logistic regression based on the

inputs of the XGBoost and other types of techniques in

the family.

Kirill Eremenko: 00:30:15 Wow. Thank you, very, very detailed and advanced. This

data, perhaps, I think, you mentioned, target meaning

code.

Scott Clendaniel: 00:30:21 I got really passionate about this stuff, so I may bury you

in detail, and I apologize.

Kirill Eremenko: 00:30:24 That's okay. That's okay. That's okay. I want this to be an

advanced discussion, well, advanced learning for me.

Scott Clendaniel: 00:30:33 That's fine. That's fine.

Kirill Eremenko: 00:30:34 I wanted to ask you this data prep technique, target,

mean encoding. I don't know it. I haven't heard of it

before. Could you tell us a bit about it, if you can just

explain what does it do and how [crosstalk 00:30:44]?

Scott Clendaniel: 00:30:44 Sure. Let me give you a really simple example. Let's say

that we are an auto insurance company, and we want to

predict whether the old myth that red cars always cause


more problems than others. I've got a red car, a blue car,

a yellow car, a white car and a black car. I've got five

different car colors. I want to encode into my model this

categorical variable, so rather than use the original values

of the colors in the variable, you use a transformed

variable, so a new variable. Instead of recording the car

color there, you actually put the claim rate for each color.

Scott Clendaniel: 00:31:29 If a blue car has a 0.1% claim rate, you put 0.1 at any

time it's blue. If it's really red, you'll use 0.2% any time

red shows up. In other words, you convert the original

categorical input into its actual target mean. What's the

mean rate that this issue is going to come up with? You

use the transformed variable as opposed to the original

variable. It tends to work better than one hot encoding,

which you would usually use. Also, you can use it in

pretty much any algorithm, including algorithms that

only take numerical inputs like the original XGBoost.

Kirill Eremenko: 00:32:10 Wow. Wow. That's awesome. I hadn't heard of that one.

I've heard of one hot encoding, but still, do you mind

refreshing my memory on that, please?

Scott Clendaniel: 00:32:18 Sure. One hot encoding, so for each of our five car colors,

we're going to take the original variable, and we're going

to move it out of the dataset. We're going to create five

new variables. One variable is going to be, "Was the car

color white, zero, one?" If it's white, you can put one. If it

wasn't white, put zero. When you have a second column

that says, "Was the car color blue?" The third variable

says, "Was the car color red?"

Scott Clendaniel: 00:32:42 You can run into issues with that, and that you can end

up with a sea of categories, and you can overload your

model with way too many inputs.

Kirill Eremenko: 00:32:55 Gotcha. Gotcha. You also got to be careful of the dummy

variable trap there, right?



Kirill Eremenko: 00:33:00 You have to have one less than the original number of

categories. Gotcha. Very interesting. Thank you. That's

very exciting. Let's move on. Now, let's talk a bit about

models. Sonam, Sonam, I hope I'm pronouncing it right,

asks, "What are the parameters to look out for and or test

to perform that indicate model drift while monitoring a

production AI or machine learning model?" To start off,

what is model drift?

Scott Clendaniel: 00:33:33 Model drift is a silly name, but basically, it means that

the performance is going to drift over time. All models

tend to decay over time, and different models decay at

different rates. There are a couple of things that I'd

suggest with this. Number one, assume that your model

is going to decay over time. Don't guess it's going to

decay. Just assume it's going to decay, and set up a

schedule of when you're going to retrain any model you

have.

Scott Clendaniel: 00:33:59 That's the first component. Plan for obsolescence before

you put the first model in. That's very important. The

second thing is if I use what's called a population stability

report, which sounds like some bizarre sociology

experiment, but essentially, what it's doing is to say, "My

data that I started off with in one period, how similar does

it look to data that I'm looking at right now? Are those

two populations similar or not?"

Scott Clendaniel: 00:34:27 When the population stability report comes in and says,

"Hey, wait a minute, your data starts to look different,"

you definitely need to retrain your model, and that the

world that the model is trying to represent has changed.

Therefore, the accuracy of its predictions has changed.

Just assume it's going to happen. I get irritated with folks

who just want to go out there and say, "I have created the


world's greatest model, but it's based on data from seven

years ago. And I don't know why it's performing poorly."

Scott Clendaniel: 00:34:54 Come on, seriously? If you put in your model, test it

constantly, and use that population stability report to

keep your eye open to see that the world's changed.

Kirill Eremenko: 00:35:04 Gotcha. A couple of questions here. First one will be,

"Why..." This might be a naive question, but I'm just

tempted to ask, "Why do models decay over time? Why do

they never get better? Why is it that way?"

Scott Clendaniel: 00:35:21 Trick question. They decay over time because they

represent the world as they knew it, and the world

changes. The model does its absolute best to represent

the world as it saw it in your training set. When the world

changes, the type of data that shows up in your training

set may drift. Let's pick an easy example. Inflation, prices

tend to rise over time for lots and lots of things. If your

original model was based on prices from three years ago,

the prices now look different, so the model needs to adapt

to reflect that change to come up with representation of

what happens today.

Scott Clendaniel: 00:36:02 That's why all models tend to drift. However, that isn't

necessarily a bad thing in that you may have learned

more information over time. You may have a larger data

set to look at. You may have found new variables that you

didn't see before. You may have new algorithms you want

to test out. Actually in the end, you can end up with a

model that is more accurate than the last model was at

its peak. Over time you actually can get better. That's one

of the things that's exciting to me.

Kirill Eremenko: 00:36:32 Once you update it, of course, right? If you leave it alone,

it's not going to get better.

Scott Clendaniel: 00:36:37 No. I wish.


Kirill Eremenko: 00:36:39 Gotcha. In your answer on LinkedIn, so it was really

inspiring to see that you went through and answer to

everybody, you do a huge part for data science.

Scott Clendaniel: 00:36:52 Tried. If I missed anybody, send me a message.

Kirill Eremenko: 00:36:55 Awesome. You mentioned that six months is your magic

number to look at. Why is that?

Scott Clendaniel: 00:37:05 It is completely arbitrary. I'll tell you why. If you try and

make it annual, it tends not to happen. In the real world,

organizations are like, "I don't have it." We did it last year

or whatever. If you make it six months, you keep it top of

mind with everybody. Six months is the outer limit, and

then if the population stability report says, "Wait a

minute, the world looks different, or the performance of

the model tends to fall up." You've got a fraud model, and

your fraud rate keeps inching up.

Scott Clendaniel: 00:37:36 Either those two events, you look at it in a shorter period

of time. But if you actually build that into the calendar on

the front end, and also explain to your stakeholders,

"Model drift is a thing, and you need..." Modeling is a

process. It's a process of learning, and it's adapting to

change as information changes. Just bake that into the

schedule from the beginning, and you're going to be in a

lot better shape.

Kirill Eremenko: 00:38:00 Been there, it was in an organization once where they

didn't look at the model. Some consultants delivered a

segmentation model. They did... oh, no, prediction model,

prediction in terms of who will churn, who won't churn. It

didn't look at it for 18 months. When we had to look at

it...

Scott Clendaniel: 00:38:21 Ouch.


Kirill Eremenko: 00:38:22 Its accuracy dropped from 78% or something like that

down to 49, so it was better to flip a coin.

Scott Clendaniel: 00:38:32 Well, let me throw out a made-up word for everyone.

That's nonstop optimization. Instead of optimization, if

you're constantly nonstop trying to optimize, you call it

nonstop optimization. Senior management loves phrases.

I'm kidding. But if you take on that theory that I am

always going to be improving my model, I'm always going

that it's an organic process, that it's something that you

put in as a regular business process as opposed to the

snapshot one time event that's going to fix all our

problems. I think you'll be in better shape.

Kirill Eremenko: 00:39:06 Great. The population stability report, what do you look

at there? Do you look at means, distributions? What are

the [crosstalk 00:39:13] or maybe-

Scott Clendaniel: 00:39:14 It'll actually give you an indicator on a scale of zero to one

and on how much things have changed.

Kirill Eremenko: 00:39:21 It's like it's a library in Python?

Scott Clendaniel: 00:39:24 Yeah. For example, if incomes on the original data set

were $62,000, and now it's $140,000, something's wrong.

It also helps you to figure out if your data stream has

been corrupted. In other words, let's say the model works

fine when it has accurate data, but somehow, something

has happened, and now the data isn't as good as it was

before, you can then go in and say, "Okay, wait a minute.

Something's wrong. The population looks different from

what it was before."

Scott Clendaniel: 00:39:54 Maybe we've got a problem with the database. Maybe

we've got a problem in how the data was collected. Maybe

we've got a metric versus English units issue that

suddenly popped up.


Kirill Eremenko: 00:40:06 Gotcha. Anybody can go and download this population

stability report, or everybody needs to build their own

version of it?

Scott Clendaniel: 00:40:14 Yes. The formulas are out there. I can't recite it off the top

of my head, but if you just type into Google population

stability report, it can give you a walkthrough of that.

Kirill Eremenko: 00:40:22 Awesome. Fantastic. Speaking of data collection, there

was a question from Santosh. How do we check stability

and consistency in the process before using the data it

generated from model building? I understood there was a

bit of he meant one thing, and she first answered another

question, and then he answered the second one. Let's

start with the first one. The first one was like, "How do we

check for stability and consistency in terms of the data

collection process?"

Scott Clendaniel: 00:40:52 Sure. Usually, this is done sort of further upstream before

when we get data. This is usually done by the folks who

are doing your data ingestion or your ETL process on the

data upfront. The point he's making is extremely

valuable. It's as simple as garbage in garbage out. If the

data you're putting into your model has flaws in it, your

model isn't going to work. One of the things is actually sit

down with the people who are the stewards of that data.

Scott Clendaniel: 00:41:20 How is this collected? How often? How long have we been

keeping track of this? This is also one of the reasons why

data visualization is so important. See if the data makes

any sense before you start loading it into your model. His

point, I think, was data scientists are so excited. They

want to have a model. They want to have results. They

want to use their area of expertise. They want to pull the

algorithm out of the quiver and start shooting.

Scott Clendaniel: 00:41:47 That is a problem if you haven't checked the data

consistency upfront. I will give you a real world example


from a client who shall not be mentioned, where the

original data stream, there was some type of corruption

when data was migrated from one database to the next

database. If it was a dollar amount, they literally

physically typed the dollar sign in the value. If there was

a comma, they type the comma. If there was a decimal,

they type the decimal.

Scott Clendaniel: 00:42:16 At other points in time, this wasn't true, so it was just the

raw number. This is why you really need to understand

your data. To his point, you need to make sure that data

seems to make some type of sense. Sit down with the

steward of that data to make sure that you understand

what you're dealing with before you get too far down the

pipe.

Kirill Eremenko: 00:42:34 Awesome. Gotcha. Then you also talked about your

answers, something that intrigued me. You said that if for

some reason, the data is corrupt, then cross validation of

the model should also fail. Basically, we could probably

use cross-validation as I understood as a indication,

whether there's problems [crosstalk 00:42:58].


Kirill Eremenko: 00:42:59 Tell us a bit more about that.

Scott Clendaniel: 00:43:01 The great thing about models and cross validation is it's

virtually impossible to come up with a great cross

validated model based on bad data, because it just won't

work.

Kirill Eremenko: 00:43:14 What is cross validation?

Scott Clendaniel: 00:43:15 When you're taking cross validation, you're taking the

original data set, and you're dividing it into what they call

folds. Let's say we're going to take your dataset. We're

going to break it into five folds. You test the first model


built on four of the folds, and you leave out the fifth. On

the second one, you may use folds two through five, and

you test it on the first. You want to make sure that those

results look very similar.

Scott Clendaniel: 00:43:38 If they don't, you've got a problem in the model, so you

either average the results of the folds, or what I tend to do

is definitely do that first part, but also go back and say,

"What is weird about that one fold that doesn't seem to be

working very well? Why is the performance off for this

version versus that version?"

Kirill Eremenko: 00:43:56 Gotcha. Is the data in the folds randomized? Before you

select defaults, you randomize them.


Kirill Eremenko: 00:44:04 Gotcha. Just to recap, let's say we have, I don't know,

100,000 or 500,000 records. Then you break it down to

five groups of 100,000 each. In the first version, you train

the model on the first four groups of the data, and then

the fifth one, you test it. In the second version of the

model, you train it on, say, the first, second, third, and

fifth groups. Then you test it on the fourth. Then you

train it on the first, second, fourth, and fifth, and you test

it on the third.

Kirill Eremenko: 00:44:38 You're always shifting this window. Ideally, you should be

getting the same results throughout, similar results.

Scott Clendaniel: 00:44:48 They should be really similar. Also, your final model

should take the average of each prediction.

Kirill Eremenko: 00:44:54 Gotcha.

Scott Clendaniel: 00:44:56 There are some folks who just use cross validation for

testing hyper parameters and things like that. If


production allows me to be able to use all five models and

take the average of the scores, that's what I like to do.

Kirill Eremenko: 00:45:15 If you have corrupted data or probably errors in the data,

if you randomize the data before you do the five bands,

wouldn't that corrupt records? Wouldn't they distribute

equally across the five bands, and therefore the model

would still perform identically, but still there would be-

Scott Clendaniel: 00:45:37 Well, it would be identically terrible performance. You're

not just looking to see if it differs among the five. If my

area under the curve is 51%, I've got a problem. I need to

be really aware of that. To your point, yes, they will look

similar, and they will look similarly awful.

Kirill Eremenko: 00:45:59 Gotcha. Gotcha. Could you do the cross validation

without randomizing first? Then you would more likely

have one of those bands that is definitely

underperforming.

Scott Clendaniel: 00:46:12 It's no. I would not recommend that because you lose...

The real advantage of the randomization is the fact that it

eliminates or greatly reduces the chance that you're just

looking that a few records are off, or that you've got

outliers or that type.

Kirill Eremenko: 00:46:29 Gotcha. Gotcha. Awesome. Thank you. Next one was...

This is an interesting question. This one is more about

complexity of machine learning models. What is your

experience... This is a question from Desmond. What is

your experience with the complexity of machine learning

models and alpha or out performance to the most

complex models XGBoost, neural nets yield the highest

alpha, or are there other factors that yield higher alpha?

For example, type of data feature, engineering, et cetera.

First of all, what is alpha?


Scott Clendaniel: 00:47:14 Well, I'll tell you what. I am going to do what all good

interviewees do. When they don't have the pure

understanding of the answer, they change the question.

I'm going to treat this as a question on over fitting as

opposed to the pure definition of alpha. I'm going to

regard over performance as a function of over fitting the

model, which means that it's basically memorizing the

data. As I said, I am older than dirt. When I went to

school and studied math, they used to do this weird thing

where they would give you the answers to the odd

numbered questions in the back of the book, but they

wouldn't tell you the answers for the even number of

questions.

Scott Clendaniel: 00:47:54 What would happen is if you just tried to memorize the

answers, you'd only be 50% right. Models tend to be very

greedy. If they get the chance to memorize the answers,

they will do it. What you want to do is to make sure that

it's very difficult for the model to actually do that.

Otherwise, what it's learned is what the answers to the

specific records that you gave it is as opposed to

identifying the true patterns that can be applied to other

folks. If I went and I got a suit, and it was completely

custom fit but for somebody else, it's not going to fit me

very well.

Scott Clendaniel: 00:48:32 It's over fitted to the wrong person. I want to make sure

that my model generalizes well. I want to make sure that

my model applies to all kinds of different people. Over

fitting is a big problem. The more complex the model,

such as deep learning, if I've got a 600-layer deep learning

neural network, and I've got 10,000 records, I've got a

problem, because in many cases, it's going to try and

memorize the data itself as opposed to learn the patterns.

Algorithm can be a component.

Scott Clendaniel: 00:49:03 You can also have data leakage issues, where it's

memorizing the answer because it's actually included in


the original data set. The ways to avoid that or an answer

to the question, algorithms can be a problem. The data

can be a problem. There are all kinds of things that can

lead you down that path, ways to get around that or to

use very robust validation methods including cross

validation to try and eliminate the possibility of models

actually memorize the answers in the back of the book.

Kirill Eremenko: 00:49:35 Wow. Wow. That's fantastic. I just got transfixed on your

answer. [crosstalk 00:49:43].

Scott Clendaniel: 00:49:43 Oh, sorry.

Kirill Eremenko: 00:49:44 That's awesome. Thank you. Definitely an important point

to look at. At what stage would you say people should

keep that in mind when they're building a model?

Scott Clendaniel: 00:50:01 From the beginning. Part of what we as data scientists,

we tend to be really tempted to jump in and build them

up. I'm a model builder, so I want to build a model. That

might not be the first thing that you do. You really have

to have a solid design in terms of what your model

building process is going to be. Coming up with the

validation strategy should be one of the first things you

do because you've got to segment your data out into

what's going to be test, what's going to be training, what's

going to be validation or cross validation before you start

the modeling process, or you've already contaminated the

experiment, so to speak.

Scott Clendaniel: 00:50:39 In your process design, before you sit down, that should

definitely be a component that you look at.

Kirill Eremenko: 00:50:47 Gotcha. How much time should be spent on designing the

process versus implementing the model?

Scott Clendaniel: 00:50:54 Well, if you come up with a standardized process in terms

of how you select variables and your randomization, you


can actually bake it into the process. Once you do it once,

you shouldn't have to repeat it a whole bunch of times,

but you should definitely be very robust on your first

model and see what components you can redo.

Randomization should be something you should be able

to do in every model in a specific script. You shouldn't

have to reinvent the wheel every time.

Kirill Eremenko: 00:51:21 Gotcha. The follow-up question from Santosh was he

said, "Give an example." Let's say I want to build a

forecasting model for daily pizza sales, and say I have

data for the past model.

Scott Clendaniel: 00:51:36 After my own heart, pizza sales.

Kirill Eremenko: 00:51:40 My question is unless the processes that drive for pizza

sales are consistent, we can't rely on the data. For

example, there was a change in employees every month,

the change in tools being used and so on. On the other

hand, if the pizza store is started just a few weeks ago, it

might not have reached this stage because... The main

thing is how do we know? If there were changes in the

process, you have data for a whole year, but then there

are changes in the process, employees or how we do sales

and so on. Can we still use that for modeling?

Scott Clendaniel: 00:52:15 Sure. What I'll say to that is you're absolutely right, very

important point, but it is very rare that a modeler is ever

going to have a perfect data set to start with. The trick is,

"What is my decision making process now, and can I

improve it with the model, and if so, by how much?" You

never get to perfect. You try and get to a perfect

representation of the system for your pizza prediction, but

every organization has employee turnover. Every

organization has some of those elements in play.

Scott Clendaniel: 00:52:50 That's why you also have to be very careful with your

validation strategy to make sure that your model holds up


on data it's never seen before. That's why keep banging

that drum. The question is, "Is there something that I can

learn from the data that I have right now using standard

validation procedures?" If I can, and I increase the

performance of my decisions, if I make 10% better

decisions, does that help my business? If it does, the

Teddy Roosevelt quote about, "Do what you can with what

you have where you are right now."

Scott Clendaniel: 00:53:26 Make sure that your client, whoever, is going to be using

your model understands. This is what I think I know.

This is what I don't think I know. This is how much I

think I can improve performance with the model over

where you are right now. Then you work with a client to

say, "Is it worth the effort? Is the juice worth the squeeze,

so to speak?" That's why you need to work with your

client as opposed to just turning this into some academic

exercise off in an ivory tower.

Kirill Eremenko: 00:53:52 Nice. Fantastic. Scott, thank you. Those were all the

advanced questions. We're moving on to the world-

Scott Clendaniel: 00:54:02 Okay.

Kirill Eremenko: 00:54:03 You did well. You did really well. We're moving on to more

soft skills and predictions and forecasts for the future. A

question from Muhammad, "What is the difference

between a data scientist and leading a data science

division?" Basically, what is the difference in skills

required to be a data scientist and to lead a data science

division?

Scott Clendaniel: 00:54:33 To lead a data science division, I think you need the skills

of a data scientist plus a couple of other things. One

strategy, making sure that you're leading the appropriate

goals across all modeling projects, not on your individual

modeling projects and management. You need to be able

to work with people to coach them to get the absolute


best performance that they can achieve, not just what you

can do best, and a lot on what you're working on right

now.

Kirill Eremenko: 00:55:01 Are the people skills required for a data scientist such as

communication, presentation? Are there still people

skills? Are they different, and how are they different to

the management people skills required for a lead data

scientist?

Scott Clendaniel: 00:55:15 I don't think that they are different. I think the problem is

people skip over them altogether. The quality of

management in general in most organizations can be

somewhat appalling, not just for people who manage data

scientists, but for people who manage any type of group. I

think it's a real chink in the armor for all types of

organizations. I think the difference would be that you

may communicate to data scientists in their own

language.

Scott Clendaniel: 00:55:43 You must be able to speak their language and be able to

establish their trust and be able to work with them to get

them to their highest performance. If you were dealing

with an accounting team, you need to be able to speak

the accounting language to be able to help them reach

their highest performance. It's not so much different

skills. It's applying the right skills to the type of game you

have, I think.

Kirill Eremenko: 00:56:07 Gotcha. I know you're passionate about leading data

science teams. Why are you passionate about that?

Scott Clendaniel: 00:56:15 Because I think so much can be done outside of just the

algorithm, and I think there has been such a push,

especially in the past 10 years, on the type of algorithm

you use. Algorithm isn't necessarily what's going to lead

you to the best performance. I'm going to steal a story

from Stephen Covey. He said that pretend you had a


bunch of folks, and they're trying to lead a trail through

the jungle. They're like, "Okay, we're going to have Fred in

the front of the group because he's really good at dealing

with a machete, and we're going to have Michelle. She is

the absolute expert at machete sharpening, and she's

really good at that part, and such and so forth and

everything else."

Scott Clendaniel: 00:56:53 The leader is the one who climbs the ladder up and

shouts down, "Wrong jungle." You got to be able to

change. You need to be able to figure out if you're in the

right jungle or not. A lot of managers are not terribly good

at that, and so you need to have that holistic view in

addition to the expertise of data scientists.

Kirill Eremenko: 00:57:17 Fantastic. What advice do you have for data scientists or

advanced data scientists who wants to become leaders or

who wants to become data science managers?

Scott Clendaniel: 00:57:32 It's understanding a fact that all data scientists need to

come to grips with, and that is that data science is not

about data. It's about people. Let me explain what I mean

by that. You're trying to solve people's problems. You're

trying to help people. You're trying to communicate with

people. Whether you're in accounting or data science or

medicine or nuclear physics or social work or anything

else, it's about people. A lot of people come into our field

unfortunately, because they don't really like dealing with

people. They like numbers more.

Scott Clendaniel: 00:58:04 It needs to be a blend. At the end of the day, you're

always trying to help people meet their needs. Data

scientists use data and algorithms and techniques to be

able to achieve that goal, but the goal isn't different. The

more you understand people communication, storytelling,

visualization, the so called soft skills, that is going to be

what greases the wheels to be able to get people to where


they need to be, and to solve their problems in the best

possible way.

Scott Clendaniel: 00:58:35 You can't surrender on that and lead a data science team

and be terribly effective.

Kirill Eremenko: 00:58:42 Love it, so not just data science leadership, but data

science itself is about people. If you one day want to

become a data science leader, then start now. As a data

scientist, start honing in on your people skills.


Kirill Eremenko: 00:59:00 What's the recommendation? How does somebody go

about... There's not many online courses on people skills.

A lot on technical skills. Where do you learn the people

skills?

Scott Clendaniel: 00:59:10 It depends on where you look. I'm actually going to push

back on that one a little bit. We in the data science

community like to read our data science blogs. If you

focus there, that's where you're going to find those

results. There are all kinds of resources. I'll tell you my

particular favorite is the work that was done by Stephen

Covey, and also my book recommendation, I'm going to

sneak in here while you're not looking, which is the Seven

Habits of Highly Effective People.

Scott Clendaniel: 00:59:36 It talks a lot about people skills, talks a lot about problem

solving. That can be applied to data science or accounting

or physics or sociology or anything else, and focusing on

those types of skills. Also, data visualization classes, not

just to show how to visualize data, but you choose a

visualization based on your audience and what that

audience's needs are. Focus on that piece. What do we

need to communicate?


Scott Clendaniel: 01:00:02 What do they need to be able to solve their problem, and

how do I give them back? As opposed to, "Here's my way

manky cool analysis that I did, 700 pages long that no

one's ever going to read." That's not the solution to the

problem.

Kirill Eremenko: 01:00:17 Gotcha. Absolutely. Let's move on to future questions, so

about the future. Snehal asks, "What will be the next

after advanced AI?"

Scott Clendaniel: 01:00:34 If we're not careful, we're going to run into another AI

winter. Let me tell you what I mean by that. If you look at

Gartner's Hype Cycle, where they talk about the stages of

a technology, you tend to get overly hyped expectations,

and then you end up falling off the cliff into what they call

the trough of disillusionment. We can bicker over whether

those are good names or not. But if you set people's

expectations too high, and then you don't meet them,

they don't tend to say, "Scott Clendaniel's particular

model didn't do very well."

Scott Clendaniel: 01:01:07 What they tend to say is, "I knew that AI stuff was a

bunch of hooey, and we never should have invested in it. I

don't want to do a model again. I don't want anyone

coming in here talking about statistics. I don't want to

hear about machine learning. I don't want to hear... It's

all garbage, because Scott Clendaniel's first model didn't

do well." You need to really be aware of that. 85%,

according to Gartner, of all models do not reach

production.

Scott Clendaniel: 01:01:29 Think about that for a second, 85%. Our industry

currently has a 15% success rate. I don't know of too

many fields who can survive that, so my big concern

about AI is unless we get back to actually solving the

organizational issues and fixing the problems as opposed

to, "Gee, look at my AUC. Doesn't it look great?" The

future of AI is going to go into a dark period for a while.


Kirill Eremenko: 01:01:57 What about on the flip side? Jacques asks, "There's a fear

that AI could replace humans in their jobs. What would

you tell a concerned human being about that?"

Scott Clendaniel: 01:02:11 I would look at a lot of the research that's come out of

RPA, robotic process automation. To the largest extent

possible and the conferences I've been to is people's jobs

don't get replaced. In other words, people don't get

replaced. They get different jobs. Meaning that if they're

working in the accounting department, they stop working

with copying stuff from Excel from spreadsheet A to

spreadsheet B, and get to work on, Do we need the

spreadsheet in the first place?"

Scott Clendaniel: 01:02:37 That's not a bad thing. There was a lot of concern that AI

was going to replace all kinds of people, and I haven't

seen a lot of it happen yet. I don't think that radiology is

the first career I would jump into right at the moment,

because a lot of that is being automated, but you may

end up with different types of jobs that a radiologist might

apply like, how you apply the results from lots of different

analysis, from all kinds of different x-rays in terms of

diagnosing a disease.

Scott Clendaniel: 01:03:05 I think some jobs always get replaced by technology.

There aren't a lot of buggy whip manufacturing jobs left

anymore, but I don't see AI replacing all kinds of people.

The head of Stanford's AI lab had a great quote that said,

"We're a lot closer to discovering a smart washing

machine than we are of terminators taking over the

world." I think that's true. I think we need to be careful of

it, but I think people are perhaps overly concerned at this

point in time that AI is going to replace everyone's job.

Kirill Eremenko: 01:03:42 Gotcha. There was a report by the World Economic

Forum in 2018 that predicted that I think by 2025 or

2022, I'm not sure exactly of the year, but AI will displace


75 million jobs worldwide, whereas, it will create 133

million jobs. That's a coefficient of 1.7.

Scott Clendaniel: 01:04:07 I think that's a much better example than the one I just

gave.

Kirill Eremenko: 01:04:13 I think they're both absolutely valid. What are your

thoughts on AI replacing data scientists themselves,

specifically auto ML and products like data robot?

Scott Clendaniel: 01:04:25 I need to be careful here. I need to choose my words

correctly in terms of that.

Kirill Eremenko: 01:04:29 We can skip that question. That's okay.

Scott Clendaniel: 01:04:32 I think that a lot of AI functions can be helped by

automation. I think ensembling, I think picking the

correct algorithm, I think hyper tuning parameters, I

think a lot of those will become automated, but there's a

lot of room for creativity on the feature engineering side.

There's a lot of room for creativity that's hard to replace.

Even something as simple as ratios, models are terrible at

calculating ratios. They just are.

Scott Clendaniel: 01:05:02 For example, if you think of a credit score, if you think of

everything I currently owe as one input, not a great

predictor. If you think of how much total credit I have

available, if you add up the credit lines from all my credit

cards and stuff, also, not a very good predictor by itself. If

you use a classic algorithm to go and say, "Okay, let's

throw them both out, because they don't have high

correlation to my result." The trick is the percent of

utilization, so out of that big pile of credit, what

percentage am I using is hugely predictive in terms of

your credit score.

Scott Clendaniel: 01:05:35 It's those types of things as simple as ratios that I think

are going to be hard to automate away. I think that many


functions may be assisted by automation even in data

science. I think if we focus on the right skills and the

problem solving aspects and those type things, it is less

likely to be automated away at least in the short term.

Kirill Eremenko: 01:06:00 Gotcha. Thank you, very, very cool answer. Adly asks,

"Does every business need to adopt AI?"

Scott Clendaniel: 01:06:11 No is the short answer. [inaudible 01:06:16] not every

business does. I think it's silly for us to attend that every

business in the world needs AI. I think every business

could use to make better decisions than they do right

now, and to the extent that AI helps with that, great. To

the extent that AI doesn't help with that, no. Also, there

are some businesses that don't have a lot of good data. If

you don't have good data, you can't really build great

models, so AI isn't going to be a particular help.

Scott Clendaniel: 01:06:45 I think that every business needs to make better

decisions, and businesses that have access to good AI

should take advantage of it. Those that don't, don't worry

about it.

Kirill Eremenko: 01:06:57 What are your thoughts on Andrew Ng's comments that

AI is the new electricity, and similar to how 100 years

ago, only 50% of the U.S. was electrified? Now, everything

uses electricity. AI will also similarly but faster be

adopted by virtually all organizations. Otherwise, they'll

be lose in terms of competitive pressure. What are your

thoughts on that, with the thought in mind that not every

business needs AI?

Scott Clendaniel: 01:07:25 Well, let's follow that example through. You used to have

organizations in the 1920s who had a CEO, but it wasn't

a chief executive officer. It was a chief electricity officer,

whose sole responsibility was how to figure out... I don't

know a whole lot of organizations that are still hiring chief

electricity officers. I think that better decision making,


again, is the key more than AI itself. I think it's going to

help more and more industries.

Scott Clendaniel: 01:07:55 I just don't think that you're going to have VIKI from

iRobot making all the decisions for the planet. I think that

is an overblown fear. I think it's going to impact more and

more organizations, but I think that we tend to go to this

pendulum. "No AI. It doesn't help anything," to, "AI solves

everything." The answer tends to be somewhere in the

middle, and just be aware of that.

Kirill Eremenko: 01:08:25 Gotcha. Understood. One final question, this one will be

from me. There are so many types of ways to structure

your data science division. This is like a data science

management style question. One is to integrate individual

data scientists across different functions like sales and

then finance and operations and so on. Another one is to

have a centralized data science team, who service all

those functions. What is your preferred style and why?

Scott Clendaniel: 01:09:04 I'm going to steal this one from Harvard Business Review.

That's to use a hub and spoke model. You have a central

core of folks who help the rest of the organization work on

data science project. These are the folks who are going to

be making sure that folks have the right tools that they

need to help establish some processes, some standards,

and so forth. That team is very small, and that most data

scientists sit in the individual group that they need to

serve.

Scott Clendaniel: 01:09:31 Your hub supports the people and the spokes in the

different departments, and help them achieve their goals.

I think that is the best way to do it. It is so easy to have

folks be in a data science group who were out of touch

with the needs of their clients that's actually making

them physically sit in that organizational structure helps

solve a lot of those needs. That's the way I would do it.


Kirill Eremenko: 01:09:58 Wow. Fantastic. Thank you. Scott, it's been amazing.

We've actually gone over time, but it was totally worth it.

Loved these questions and your answers. Before I let you

go, before we wrap up, I wanted to ask you what... Do you

have a recommendation or just some piece of advice for

specifically advanced data scientists out there who are

listening to this podcast, so any parting thoughts?

Scott Clendaniel: 01:10:26 I do. That is that... I'll tell it through another story. The

first time I was ever invited to participate in an AI

conference, I went running into my co-worker's office. Her

name was Beth. I said, "Beth, it's fantastic. I've been

invited to speak at an artificial intelligence conference.

Isn't that great?" She folds her arms across her chest. She

leans back in her chair. She raises one eyebrow and says,

"Do you really feel qualified to speak at such a

conference?"

Scott Clendaniel: 01:10:57 I was like, "I did 10 seconds ago." There is a lot of folks in

our industry who look like Beth. That's a bad idea. We

need to be inclusive. Get down off your high horse. It's a

technology. It's an area of expertise. We need to be

inclusive, not exclusive. Try and be nice to people. Try

and help them achieve their goals. Common manners and

being polite and listening to people are really important in

any field. But if you're hoping for AI to have a big impact

in your company, trying to prove how smart you are and

how unsmart they are is a really bad idea.

Scott Clendaniel: 01:11:37 There are way too many of us who do that. Be inclusive.

Incorporate as many people as you can. Be as helpful as

you can, and stop doing this approach that, "I am some

type of God of intellect because I know how to build a

model." You can build a model and tell folks. A lot of

people could do that if only would someone take the time

to show them how to do it. Be the person who's helping

bring more people into the fold, not explaining to

everybody else why they're wrong.


Kirill Eremenko: 01:12:07 That's amazing advice. You actually walk the walk and

talk the talk, right? That's the saying. You live by that

yourself.

Scott Clendaniel: 01:12:17 Thank you.

Kirill Eremenko: 01:12:19 I look at your comments on LinkedIn, and you're always

there supporting people, answering questions every time.

Even in this thread, people asked you a question, you'd

not just say, "Ask the question and thank you," but you

actually put a little image of a thank you, a written thank

you. Every time, a different one. I could just imagine you

have a whole library of these that you can use at any

given time. It's really cool. Why do you do it? Why do you

help the community so much?

Scott Clendaniel: 01:12:48 I think because I was treated so poorly by the experts in

our field when I tried to break into the field. I like to tease

with people that for the first half of my career, people told

me I couldn't do this because I didn't have a PhD in

statistics. The second half of my career, everyone tells me

I can't do it because I don't have a PhD in data science. I

was like, "But some of my models seem to be working

pretty well, I don't know." I think that it's just a way of

bringing more people in and being helpful because people

need encouragement.

Scott Clendaniel: 01:13:22 We've got enough people out there in the world telling

everyone else that they're wrong. I think a little bit of

kindness and support to other people goes a long way. I

think we're in better shape if we all just treated one

another with a little more respect, a little more kindness,

and a little less roughness and a little less intellectual

aloofness.

Kirill Eremenko: 01:13:42 Awesome. Fantastic. Well, Scott, thank you.


Scott Clendaniel: 01:13:44 I don't want anybody else to have a three-year-old on the

phone saying, "Did you pack my toys yet?" with no

prospects of finding a new job?

Kirill Eremenko: 01:13:54 Thank you very much for sharing that. I think it's-

Scott Clendaniel: 01:13:57 Thank you. I really enjoyed this.

Kirill Eremenko: 01:14:00 Awesome. Me too. Tell us how can people get in touch,

follow you, connect with you?

Scott Clendaniel: 01:14:07 The best way is to follow me on LinkedIn at T.Scott

Clendaniel. I can't accept all the invitation requests

because I'm almost at the 30,000 limit. They won't allow

more people in, but please follow me. If you have

questions, send me a message. I can't answer everybody,

but I do my best. I think I answered, gosh, over two dozen

questions in the existing forum, and I will continue to try

and do that.

Kirill Eremenko: 01:14:30 Fantastic. Thank you very much. You already gave your

book recommendation. Could you just remind us? I think

it was Seven Habits of Highly Effective People.

Scott Clendaniel: 01:14:38 Seven Habits of Highly Effective People by Stephen Covey,

who is no longer with us-

Kirill Eremenko: 01:14:44 Awesome. That's-

Scott Clendaniel: 01:14:44 ... but his legacy lives on, great advice in there.

Kirill Eremenko: 01:14:48 Gotcha. Wonderful. On that note, once again, thank you

so much. We'll share all the links in the show notes, and

please guys and everybody listening, connect with Scott.

This has been a great opportunity to have you on the

podcast. Thank you for coming.

Scott Clendaniel: 01:15:07 Thank you. Take care.


Kirill Eremenko: 01:15:14 There we have it, everybody. Hope you enjoyed this

conversation as much as I did. I learned a ton from this

discussion. There were so many cool advanced things

that I didn't know about before, and just blown away.

Thank you so much, Scott, for coming on the show, and

sharing all these insights with us. Perhaps my favorite

part was when we spoke about over sampling the

minority class. I could feel Scott's confidence, and quite a

tricky technique to just throw away a lot of your data in

order to make sure that the positives and negatives are

equally roughly the same.

Kirill Eremenko: 01:15:48 It's a difficult decision to make, but the confidence which

he spoke of was clear that he's done this speaking many

times. It's obviously works for him. I really liked the

discussion about data science leadership, and that if you

want to be a data science leader one day, start now

because soft skills are going to be... You need soft skills

as a data scientist, not just as a lead data scientist. There

we go.

Kirill Eremenko: 01:16:12 As usual, you can get the show notes at

superdatascience.com/385. That's

superdatascience.com/385. There, you can get the

transcript for this episode, any materials we mentioned,

including a URL to Scott's LinkedIn. Make sure to

connect with him, or just look him up on LinkedIn. It's

T.Scott Clendaniel. He's always, always very helpful. Just

recently, he shared some amazing cheat sheets for

machine learning. Even that is worth checking out.

Kirill Eremenko: 01:16:45 I had a look at them. This was some cheat sheets that

was shared by Stanford University. He shared them on

his LinkedIn, and there are some really cool cheat sheets

there, including around cross validation. Check that out.

As always, if you enjoyed this episode, share it with

somebody, especially if you know an intermediate data

scientist who's looking to become advanced or an


advanced data scientist who wants to further their skills

in the space, a colleague maybe you know, a friend, a

family member.

Kirill Eremenko: 01:17:17 Send them this episode, very easy to share. Send them

the link superdatascience.com/385. On that note, my

friends, thank you so much for being here today for

sharing this hour or just more than an hour with us. I

hope to see you back here next time, where we will be

continuing to deliver on the promise of amazing episodes

with very interesting, incredible guests. Until then, happy

analyzing.


Documents

SDS PODCAST EPISODE 385: ADVANCED DATA TOPICS AND PEOPLE-CENTERED DATA SCIENCE · 2020. 7. 22. · or an advanced practitioner in data science. Kirill Eremenko: 00:01:18 You're interested