fifty thousand students. So it’s a great · (background music) Kirill: Hey guys, welcome to the Podcast. I’ve got Ruben Kogel here ... the video quality, and the instructured

http://www.superdatascience.com/1

Kirill: This is episode number one with ex-chemical engineer and

now data science wizard, Ruben Kogel.

Welcome to the SuperDataScience podcast. My name is Kirill

Eremenko, data science coach and lifestyle entrepreneur.

And each week, we bring inspiring people and ideas to help

you build your successful career in data science. Thanks for

being here today and now let’s make complex simple.

(background music plays)

Hello and welcome to the very first episode of the

SuperDataScience podcast. I can’t explain how excited I am

to finally get the show off the ground. I’ve had this idea for

literally months now and finally today we’re kicking things

off. And what is the show all going to be about?

The show is going to be about inviting the most aspiring data

scientists in the world and talking to them about what they

do, what their background is, what they’ve learned in their

past on the data science journey and what they can share,

what insights, what tools and methodologies they can share

with us and what we can learn together from them. So, I’m

very excited that you’re here from the very start that you’re

listening to the very first episode. Thank you so much for

being part of this journey. I’m sure that together we’re going

to learn a lot.

And I’m very glad that this very first episode, we kicked off on

a very high note, I spoke with Ruben Kogel who’s a data

scientist at Udemy. So if you’re not familiar with Udemy it’s

the biggest online educational platform in the world.

Currently there are over eleven million students learning

through Udemy so if you haven’t checked them out, definitely

check them out. Our courses are basically anything that you


could potentially imagine and personally I’m an instructor in

Udemy as well. I have over twenty courses there and near

fifty thousand students. So it’s a great learning platform and

Ruben Kogel is one of the head data scientist in one of the

divisions in Udemy. So, a division that works on content and

content marketing and Ruben shared some very powerful

insights about what he does on a daily basis at Udemy and

moreover how he transferred his chemical engineering

background into a data science skill set. How he in taking

ABA and specifically selected the subjects in essential way

that he was able to learn more about data science and get

into that field so make him jump from chemical engineering

to data science through his MBA.

He also talked about communicating insights and how

important it is in the data science role. Also we discussed lots

of other topics such as identifying problems when you’re a

data scientist. How important that is. How to combine data

science and product strategy. That is something that Ruben

does on a daily basis and you’ll learn more about that. And

so that’s a very powerful skill to have especially if you’re

working in the start-up area. The start-up space in company

is predominantly in the silicon valley or any other kind of

location where you got this start up culture. Then we also

spoke about managing a team of data scientists so Ruben

quite had an experience around that he had some tips if

you’re a manager in data science or in that analytics space,

you can pick up some good tips from there.

And also, we talked about managing the inflow of request

and that is valuable to any data scientists to how to manage

the inflow or requests and then Ruben gave us an example of

his approach of the trello board, we talk about the mentors of


data science and of course, we talked about lots of different

analytics tools. We talked about (inaudible) sql and add- on

which I don’t personally know about. We talked about wagon,

we talked about r versus python. So the (inaudible) question

which one is better which is more preferable and we talked

about many many other things in this podcast so I’m sure

you’re going to enjoy it. And we even touched on self-served

analytics so a growing space of the field of data science. So

without further ado, I bring to you Ruben Kogel of

Udemy.com and enjoy.

(background music)

Kirill: Hey guys, welcome to the Podcast. I’ve got Ruben Kogel here

from Udemy. Super excited about this very first episode.

Ruben, welcome! How are you going?

Ruben: Thank you! Thanks for having me over. I’m doing great.

Kirill: Awesome. It’s great to hear you and for those of you who

don’t know, I met Ruben for the first time when I was in San

Francisco a couple of months ago for the first Udemy live

conference. And it was pretty exciting and he ran some great

presentations but just so that we got or everybody gets up to

speed, Ruben, tell us a bit about what do you do? What’s

your title in the company and what exactly do you do on your

own?

Ruben: Yeah sure. So, I’m the senior manager of analytics and

strategy in Udemy and basically what I do is that I help the

content team which is the team that looks at the courses and

the content in Udemy in order to figure out what is in the

catalogue, what’s in the selection of courses, what’s the

quality of courses, how can we improve our catalogue to


make our students happier. In practice, a lot of it might work

has to do with - are we measuring the right thing, are we

measuring satisfaction, is that data available so that the

people who are in charge of bringing in courses know which

courses are good and how do we measure also the selection

of courses so that we can slowly build a better and better

catalogue for our students.

Kirill: Awesome! That’s pretty cool so you apply Data Science

techniques in order to measure those kinds of metrics. Is

that correct?

Ruben: Totally. I mean data science comes in different parts of my

job. So just like the more basic part which is

instrumentation. So I just start to like any data endeavors

you wanna make sure you’re measuring the right things and

that data is available. So for me it means that are we

measuring student satisfaction correctly and is that data

available so that people in the company can access it and

make decisions on it. There’s another level in which I use

data sciences then there like more broader questions are

coming and so that’s what I call like ad-hoc analysis and so

someone might ask me what happens if we remove some of

the low quality courses from the platform. Can we predict all

the impact on the satisfaction of revenue? So that’s kind of a

question where you need to know the structure of the

problem but also come up with some predictions and use

some techniques to like evaluate what would be the impact

was maybe some conference and roles and there’s like

another type of question that can be, well, we know that

there’s a variety of courses and they all have different quality

scores and we wanna know what about these courses that

driving these quality scores and you know is that the audio


quality, the video quality, and the instructured delivery so in

that case, you know data science comes in to play both in

terms of structuring the problem but also running some sort

of a statistical analysis to extract the importance of the

different variables and come back with an answer saying like

– “well I think this variable is the most important and you

know if you move that variable by like 1% that will have an

impact to all student ratings but by that much.” So that’s

another example.

Kirill: That’s pretty cool. So, a kind of two main types – one is

metrics of existing metrics and how you can tweak them to

improve the experience of students but another one is where

the first image where you’re doing kind of behavioral

analytics and predictive behavioral analytics to see how you

can change things so that the future experience is going to be

better. That’s pretty cool on my view and it’s great that you

get to do both parts in your role as a data scientist at Udemy.

Ruben: Yeah, it’s very very cool. Actually what I really like about my

role here is it’s really the interface between data science and

sort of like strategy product because I get to work on data set

and they get to a lot of data analysis but at the end of the day

the people I talk to are VP of content toward you know

director of course acquisition and people who were like you

know they are like in the width. They are doing the business

and I get to give them recommendation, I get to influence,

know their decisions or influence the product with data so

it’s this really cool interface between data and business.

Kirill: Definitely and that’s what I also found that the most

interesting and the most impactful roles and careers happen

on the verge of two fields whether it’s for example you could

take like Physics and chemistry or biology and chemistry but


that’s very exaggerated examples but even in data science

one thing is just to do analytics and all the whole thing is to

do the analytics and at the same time convey the findings

and work with those people that use analytics. And just on

that as well, how do you find conveying such complicated

analytics to your stakeholders that like you said the VP of

content. Is there any certain approaches that you use or any

tips you can give to our listeners on communicating these

findings to senior stakeholders in the company?

Ruben: You know I’ll start by saying that any analysis that you do is

useless and irrelevant if you’re not able to communicate the

findings to people. So, communication is a huge part of being

a successful data scientist. And, but there’s like choose..or

like roes that I try to use when I communicate things, one is

translate technical concepts into something that people can

understand so you don’t just like talk about conference

(inaudible). You can talk about maybe your confidence in the

data or you can select. Well I think you know, the prediction

or like the predictive revenue will be between those two

balance. You don’t have to say you’re ninety-five percent

confident because that doesn’t add a lot of value but at the

same time you also wanna convey precision in your

communication and make sure that you, you don’t throw in

answers like “yeah I think we should do A because.. I think

it’s important that you convey like “ well I look at the data

and if we do A, we can lift revenue by this amount and

maybe just a little bit of an individual in terms of like what I

think we can lift revenue but this is what the analysis is

doing. So it’s like this balance between being a precise and

showing that you’ve done your homework and showing like

that this is a written in data but at the same time translating


it to layman words and stripping away any technicalities that

don’t add a lot of values to your message.

Kirill: Totally agree. I’ve got a classic example on that, how often

stakeholders especially senior stakeholders are very skeptical

about sample sizes for instance right so if you run some

analysis on a sample size of 157 or 300 they might be prone

to saying actually we want a sample size of 10,000 but you

as a data scientist know that that sample size that you ran is

significant. Statistically significant so it is important to

convey these findings in a way that when you are confident

then with yourself then you don’t have to go into all of that

detail all to explain exactly the methodology behind it but

actually just convey the confidence in the way you present,

the way you position your analysis so totally agree on that

one.

The interesting thing that I’d like to ask you and probably a

lot of our listeners would be curious about is – can you tell

us a bit of your background. So how did you come into being

a data scientist and progress, of course you progress further

in your career and now you’re the head of analytics in that

department but originally how did you get into data science

because this is quite a new field and back in the day it wasn’t

being taught in university so how did you get here and what

are the steps?

Ruben: It’s a very interesting question because I had a very

meandering path. I didn’t start at all in data. My background

was in applied physics and then I switched to material

science and I was a chemical engineer for many many years.

I was dealing with data but not the type of analysis that you

do when you are data scientist. It was a lot rougher and a lot

sophisticated and what happened is that I actually was


looking to like transition to something different. I went to

business school and you don’t typically think of a business

school as a place you learn for data science but I had this

opportunity to take to use statistic courses and one like data

mining course that I suddenly fell in love with the field and

like the more I learn about it, then the more I was really

intrigued and so I really learned the theory in business

school and that I had this opportunity to come work to

Udemy and applied some of my knowledge and that’s how I

like started my career. So it is pretty reason and It was also

like a pretty stark contrast from what I was doing before.

Kirill: Definitely. That’s a great jump and like you say, one wouldn’t

expect that you would learn all the necessary data skill at

business school but I guess you picked the right subjects

and that’s a great testament to how lucrative the data science

field is and tell us the skills that you developed as a chemical

engineer or working in that field. Is there any way to leverage

them currently because data science comes from all the

different areas? Some people come from acting classes, some

people can come from economics or finance but coming from

a chemistry background is there any skills or any particular

mind sets that you can share of us that you can leverage

with your current work as a data scientist?

Ruben: Yeah. Totally. Surprisingly, it is not the data skills that I

leverage from my chemical engineering background. It’s more

the problem solving thing in communication skills. I think

any engineering work has a heavy component of

troubleshooting and problem solving and so I was doing a lot

of that in my job and it really forced me to come up with like

a very systematic approach to breaking down a problem.

Coming up with a hypothesis and I think those hypotheses


and be extremely systematic and organized about and

structured about my thinking. So I definitely learned that

from my engineering background. The only thing that I

learned is, as I mentioned communication skills. As an

engineer especially like in my position, I was doing a lot of

like account management. I had also to translate a lot of very

complex technical experiments and results into something

that the business people could understand and the ability to

summarize complex concepts and notions into a few you

know whether it’s a slide or whether it’s an email you know

and prepared draft that really sum up the insights, it’s very

important that something that I learned in my engineering

job and something that had thrilled me also in my current

career.

Kirill: Sounds really cool and definitely the problem solving skills

that’s a very valuable thing to have when you’re dealing with

data science challenges but from what you’re saying I

gathered that your communication skills had played a very

significant role in terms of your success and how would you

recommend because – my thinking is that a lot of times when

people are starting out into the field of data science now that

it’s getting more and more popular, sometimes they can get

pitched on a hold into certain roles where they are

performing some analytics but they don’t have the exposure

to go and share their insights with stakeholders. They were

just performing certain sql queries, or certain analytical

procedures but they don’t actually have a chance to

communicate insights. So what would be your suggestion be

like from the top of your head for people or our listeners who

might be in that situation to somehow start developing those

communication skills nevertheless.


Ruben: Yeah. I think there are two ways you can do that – one is,

even in your day to day interaction with customers. I would

think of like analytics of data sciences you know you have

customers inside the company whether they’re technical

customers or business customers, people ask you questions

or ask you to perform data analysis to give them an answer

so whenever you interact with your customers, you can

always like good extra mile and structure your answer not

just as the output of a regression analysis or as a SQL query

but like you could try and explain like why you think this is

the right thing to do, what that means in practice, what

would be your recommendation so always like push the

results path not just the technical output but and package it

in a way that shows that you thought about the implications

and the meaning of the analysis so that’s one thing.

The second thing is I also like always recommend people and

that’s true of people who worked for me or people who might

working in other position that whenever you are given a

problem, it’s always a good practice to try and dig in to

exactly the problem the person is trying to solve. Because

oftentimes, someone will leave comments like “hey can you

pull this data or hey can you build a dashboard for that or

can you do an analysis around this” like really, they are

trying to solve the problem and they may not tell you what’s

the problem trying to solve it so if you engage with that

person or customer try to really get to the bottom of what

problem they are trying to solve. All of a sudden you are,

starting a conversation around what are you trying to solve,

what can you bring besides just the approved analysis, how

can you help them frame your problem and so all of a sudden

you are engaged in communication, you are engaged into like


expressing a problem, breaking in down and choose some

more elemental data problems and coming back with a

solution that addressed and really on the line problem. So

that’s one way in which you can like push the boundaries of

your current job and really expand into delivering like more

value that is built upon this communication and built upon

just like understanding the underlying problem. So that’s

number one.

I think number two also is generally speaking, analysts have

like this unique ability to look into the data, come up with

some insights that no one else in the organization has and so

as an analysts, you also have the opportunity to maybe start

addressing problems that other people may have also or

some people may not have thought about and you can create

value by being a little proactive about what you think we

should be looking into, what you think would be useful

analysis with some useful insights for other teams. So there’s

also that opportunity because no one else can really look into

the data you are the only person who has access to the data

and also can bring out the insights.

Kirill: Beautiful. Love it. Especially the - what is exactly your

problem that is one of the best skills to have as a data

scientist to help people identify the problem because they

often come to just data instead of actually identifying

problem and regardless of what level you are at as a data

scientist or data analyst eventually that skill is the one that

will push you forward and combining it with what you said

about the proactive approach that’s where you become the

doctor for the organization and you walk around and you

measure what’s going wrong and how you can help fix it. So

yeah definitely agree with those two. Those were some of the


very powerful skills to have. And you mentioned that you are

in-charge of a few people. Can you tell us more about that

how many people are you in-charged of and how did you find

to managing the team of analysts or data scientists? What

are the challenges that you face on a day to day basis?

Ruben: Right now my team is down to a one person so at the

HayDay, we had a bigger team but you know things moved

quite quickly. In Silicon Valley, so I have one person

reporting and hiring at least one person for the moment. Well

the thing that I find that is most challenging in terms of more

like managing and growing analysts or data scientists is

oftentime there‘s this tension between trying to please your

internal customers and trying to make as many people happy

as possible in the shortest amount of time. And trying to

build long term value and work on like longer term projects.

The way I think of like data scientist is like you’re the end of

the chain. There’s a lot of people in the organization you

know they manage project, they ask other people to do things

and you know eventually when you put all the contributions

together you come up with a product but when you are in

data science, you don’t ask other people to do things or it’s

very rare like usually, you’re the last person that they ask

and so there’s a lot of people coming to you and end up

having a ton of request and managing the inflow of request

and managing the different things they’re working on is can

be very challenging oftentime like a new analyst, they tend to

gravitate your words maybe like the more short term projects

they think that the last person that’s sent to do is like “oh

yeah yeah, I’ll do it right away” and to the detriment of

working on the bigger longer term and more impactful project

so that’s one of the challenge.


Kirill: That’s definitely true and that kind of flows into the art of

saying no to people who will come to you as a data scientist

especially when you’re running a department and when you

have a few successes you’ll find even more of the other

departments from the company and people and stakeholders

coming to you with requests. So how do you say no? How do

you tell people that, “hey, you’re project is really cool and I’d

love to work on it but at the same time I’ve got other

commitments? What are your tips around that?

Ruben: You know you don’t really say no. The truth is (inaudible)

you make it very clear what your priorities are and one of the

tool that we use here in Udemy is a Trello board. So Trello is

a for activity tool that, it’s like virtual post-its. Essentially it

enables you to show a public board of what you’re working

on, what are the stages of the different projects, what are the

stakeholders of the different project and if someone ask you

to do analysis you can say ok no problem, can you just

record on my board and they’re quickly realize that their card

is you know one of 50 other cards and now there’s actually

prioritization process for the team to start working on it. So

that’s one which we’re working in.

The second part is like, really what you wanna do is like you

don’t wanna work on each of the jewel request and sell them.

What you wanna build is a scalable infrastructure. You only

build scalable analytics. What that means is instead of

answering the same question over and over again, or instead

of like pulling data for everybody in the company, you build

self-serving tools. You build tables, dashboards, web UIs that

enable people to access the data that they need so that they

don’t ask you to do the data and pull the data analysis


anymore. So you can free up some of your bandwidth to work

on the more interesting project.

Kirill: Awesome. Love it. The self-serving analytics has becoming a

more and more popular concept in the world of data science.

Specifically for that reason – to free up data scientists and to

empower the end user to do their own and so just on that

what are the tools that you use in Udemy, if you can disclose

this for self-serve analytics side of thing?

Ruben: The basis of the self-serve analytics is creating a set of

summary tables that have all the relevant information that

cover ninety percent of the use case. What I mean by that is,

for example (inaudible) questions that come in Udemy is like

“Oh do you have a list of course that have been published

this month or what is its total revenue of courses that were

published last month or can you look up a particular

instructor and see how many enrolments or reviews she has

in her courses. So there’s like a limited set of question that

come over and over again and instead of like building queries

or like every time she would extract the information she

would just build one or two tables that have all that

information summarized so people can access the data

directly by either acquiring the table by sequel or like looking

it up by the dashboard. So in practice, what we’d do is we

have all our data information flow in your (inaudible) by

amazon and that lets you build tables on top of like raw

tables so we have what we call this summary tables built

upon the raw. Or they can power dashboards between used

chartio Ui interface and they can use the chartio to extract

whatever information they need.


Kirill: Okay so red shift and chartio. Yep. Those are the two tools. A

very nice answer. You store amazon in a WS for storage of

data, correct?

Ruben: Uhuh!

Kirill: And how did you find that? Has that been a reason for

transition of your organization or that’s been always the

case?

Ruben: No. It’s been a transition about a year and a half ago we used

to have only our traditional MySQL own server and then we

started exploring red shift. (inaudible) red shift and we saw

that it was much more powerful in terms of doing analytics

because like typical My SQL database is optimized for writing

but not honestly for creating new tables. Red shift is

optimized for doing a lot of joints a lot of analysis and reading

data. So, we’ve been using red shift for like the last year and

a half and we’ve been scaling the size of our cluster as our

data grows and as our analytics team grows and it’s been

serving us pretty well actually.

Kirill: I’ve heard a lot of comments about that that in AWS you can

scale with your needs and that’s one of the biggest advantage

of AWS that as your organization grows and need more

capacity to empower then to scale that you don’t have to

purchase that in advance and also there are month to month

or other type of plans there. Very convenient. Usually the

only said back organization that have common of this as well

is an organization that deal with customer data, client facing

data, such as Udemy. They would have certain regulatory,

maybe certain regulatory issues with outsourcing the storage

of the data to the cloud or to external systems like AWS. Did

you face any of that when you’re making the transition?


Ruben: Not really in the sense that the data is still secure and

actually in the united states the only data that is being

heavily regulated is health data so if you have the sort of the

health regulations that make it more difficult to work with

the cloud or you have to work with certified vendors that can

really handle the start-ups the regulation about how you

handle health data for the rest AWS serve a large range of

start-ups and they all have sensitive data but you know they

are setup to handle the sensitive data correctly so there’s not

much concern around that. The only thing is that sometimes

you want data to be used for web apps in which case you

know reading from red shift is not the best way to populate

field in your website or web app so you might need a second

representative of data that is faster to read based on calendar

storage.

Kirill: So we’ve started delving into the tool so we can move now

into the management tools like Trello which I found pretty

interesting how you get somebody to post a post-it in Trello

on your board and then they realized that “hey there are

project that’s going to be prioritized and at some point yours

might not be done very quickly” and now we moved on to red

shift. What are the other tools that you use on a day to day

basis in your analytics role.

Ruben: The two technology that I use you know all the time is SQL or

for that case is SQL which is the basis for red shift and the R

so I do much of my analysis in R. Basically that’s the

language that I learned and I am very familiar with it. Some

other people at Udemy use Python. It really varies. But I

think one of the tool are they’re very convenient and flexible

and those are the tools of the data scientists nowadays.

Either python or r they have like the packages and installers


open source. It grows as a very active community so those

are typical for data analysis. In terms of like the tool that I

use where I actually do my SQL, there’s a really neat

company called Wagon. This have a better product but they

right now have the best SQL editor especially for post

(inaudible) SQL. It’s really neat because you can organize

your queries into different folders, everything is always save

on the cloud so nothing is lost and it’s very like does a neat

interface and it’s very easy to use.

Kirill: Awesome, so that was Wagon?

Ruben: Wagon, yeah.

Kirill: How do you spell that?

Ruben: W-A-G-O-N

Kirill: Beautiful. So I haven’t heard of that one before. Something to

definitely check out. And yeah. Very interesting how your

organization has a split between r and python. I guess in the

start-up world it’s more common but in the more larger

organizations that’s been for a while that have a lot of legacy

behind them usually analysts don’t have that luxury of being

able to choose between the two. Definitely R is something

that a lot of our listeners are interested in cause you know

you maybe share a couple of packages or a couple of

techniques that you most commonly use when you’re coding

in R.

Ruben: Yeah totally so I’m gonna probably disappoint you and your

listeners. I don’t use like a lot of different packages and

advanced techniques. I tend to stick to like basic R in fact I

don’t even run like others do. I just use like basic R, I don’t

use GG plateau. I just use like the basic R graphics. I think


part of the reason is I don’t see R as a way to produce

extremely sophisticated reports or analysis. I use R mostly to

extract the information that I need and run like the basics,

the basic analysis so the typical thing that I would do in R is

like I would just import a csv file, read it, do a literal

cleaning, although I usually prefer to do my cleaning in SQL

because I don’t think you should be doing the cleaning at all.

I think really the cleaning should occur upstream and you

only launch R to do some data exploration, you know create

some graphs and do some statistical analysis. Typically, I

would run regression analysis I use to eliminate to do lots of

regression, I use (inaudible) to use random model or

understand the relative importance of the different features

and to your model so that’s kinda like my usage of R.

Kirill: Wonderful. I totally agree of that. R is definitely a very

powerful tool and every analyst, every data scientist has the

right to use it in the way they prefer and you kind of using r

in a very lean approach that can be very powerful as well.

The question – the million dollar question would be R versus

Python. What are your thoughts and why did you end up

picking R?

Ruben: My answer might disappoint you again. Like I never really

chose between the two. I started with R and it fits all my

need. I spoke to a couple of people who use python and I

quickly realize that the type of analysis that I was doing, you

know offline analysis, building offline model, trying to

understand the drivers of a particular metric, Python would

be more complex and would not add a lot of value. What I

mean by that is in my day to day job, I don’t build actual you

know data products. I don’t try to deploy a predictive model

onto our infrastructure and therefore I don’t needed to be in


Python. All I do is that download some data generally from

red shift and I try to build some model to understand what’s

driving this metric up or down and for this type of use case,

from what I read, R is like the simplest and more a direct way

of announcing the data and it’s also like the language that I

know.

Kirill: Beautiful. Definitely that is also the case in many situations

when you start of one and then you just keep going of that

one if it suits your needs and what’s the point of changing.

Totally appreciate that. Alright. So That’s really cool and we

went into some detail on the tools you use and some

techniques. I love that part of conversation. Let’s move on to

a bit of more, some of the more softer stuff. For instance, I

can see that you’ve changed and made this transition to data

science from chemistry and you’ve never looked back. You

were like powering trhough it and progressing on your career,

growing other data scientists on your team and acting as a

mentor. But along the way did you have any influence that

helped you become and persevere in this data science career

and maybe some mentors that you might have had or

hobbies or some life changing events or even some articles or

something that has really influence to you and helped you

along this career path as a data scientist.

Ruben: There was a big learning curve for you I’ve never exercised as

a data scientist and in fact I’ve didn’t really had a mentor at

Udemy and so I had to figure out a lot of things by myself or

ask the people outside and so enjoy. I really encourage

people to seek out mentors . In my case I had a good friend of

mine who had a bit of a headstart on me. He started doing

data science for five years before me. He is someone who has

strong opinions but also he’s very thoughtful about the


different approaches and choice that he was making and I

often you know had conversation with him. We had like a

regular coffee where we exchange on technologies and

techniques. I found this interaction with him very helpful. I

went to different conferences although I would say all of that

conferences were helpful but the one that I really like was the

Airbnb in san Francisco, the conference called open air

which I really appreciate it and also followed some blogs and

newsletter. There’s a newsletter that I really like called

datascienceweekly.org and it’s a collection of interesting

articles about 10-12 interesting articles every week. I don’t

read all of them I just pick the one that seems interesting

and if you know if I pass the first paragraph and it’s really

interesting then it’s for read and then like slowly I build a

catalogue of tips. Soft thoughts that I think very useful and

in particular I can’t remember if it’s thru this newsletter or

maybe something like Linkedin online. I found someone

posting in an old article from Leo Breiman. The guy who

created the random forests that he published in 2001 which

really resonated with me so I would really encourage anyone

who is considering data science or starting in data science or

someone who’s even like more advanced in their career, read

this article because for me it really has expressed exactly the

way I feel about the tension between statistics and machine

learning and the tension I feel between building experimental

models and predictive models and he does that in a very

compelling and neat way. So the article is called Statistical

Modeling: The Two Cultures by Leo Breiman which published

in statistical science in 2001. I find it like the best to read to

really get some groundings in data science.


Kirill: Wow. Wonderful. Definitely I haven’t heard of that article but

definitely will check it out. Sounds like a great read and so

that’s by Leo Breiman. Will definitely include that in show

notes. Next question I would have is if you could share of us

any recent wins of data science, wins that you’ve had in your

department at Udemy.

Ruben: Two of them come to mine. My team has been responsible in

building the new spam filter on Udemy so that’s the filter

that determines whether a review is trustworthy or not and

the old spam filter was built in a set of rules. So it was not

even a naïve bayes, it was just a set of rules that totally make

sense at the time but overtime you know people learn how to

circumvent the rules. They’ve learned that the logic filter and

they was a lot of none trustworthy spammy content on

Udemy and my team tackle this problem and we were able to

improve the accuracy of the spam filter by 8X factor so it’s a

big win for the team but mostly for the company as a whole.

That was one of them. The second one that comes to mine

was moving and ad-hoc analysis that we did couple of weeks

ago so you are an instructor in Udemy. You know that we

changed the pricing strategy and so there was a resulting

change in the student behavior and my team looked at final

price, list price, discount, all of these influence to the

decision and whether if we can build a model how student

would react to them to different pricing strategies and so it

was a very simple model and buying in nothing really

sophisticated complex and it had the ability to explain the

data that we have observe din the past and so for that reason

was powerful both because it has experimental power but

also it was built simple so people could understand what it

meant.


Kirill: That’s very cool that 8X accuracy spam filter. That’s a

significant improvement and model for behavior of different

content that’s also very interesting. As you say I’m an

instructor in Udemy, I can see the backend of these things

and how they run in backend, I can definitely see how

changes coming to play and how the platform is growing so

it’s very exciting to know that you’re behind all of those

changes. That’s really cool!

What’s your favorite thing about being a data scientist?

What’s that one thing that excites you to get up and go to

work in the morning and excites you to do your work and

what is that thing that drives you to keep going?

Ruben: I think the most exciting thing is the intellectual challenge.

It’s the fact that you are always facing new and unsolved

problems and you are the person that’s being asked to solve

that problems. And sometimes it’s just a data problem.

Sometimes it’s more complicated than that. Sometimes you

have to structure the problem and come up with the right

data questions, solve them and give them answers that

constant intellectual challenge is really what drive me.

Kirill: And for our listeners, from your perspective, from what

you’ve seen, what you see currently in the field of data

science, how you’ve seen it evolve since you’ve joined the

ranks. Where do you think this field is going? What do you

think our listeners should prepare for in the future? What

should they focus on? What skills should they develop or

what techniques should they think about it or generally

where do you think this whole field is going?

Ruben: Yeah. That’s a good question. It’s hard to really predict where

this is going and to like a generalized view, I can mention a


few trends and I can mention also a few area also where I

think people can really make a difference and show like and

add value. So right now there’s a lot of talks around data

platform. So it’s not just that you had a data infrastructure,

you’re mining the data and you’re building some models to

work some insights but there’s also this idea that in order to

operate a efficient data team and efficient analytics team you

need to have a better platform that enable data scientists to

deploy experiments, run experiments, quickly build and

validate models. So there’s that aspect of building the

pipeline then the workflows that enable people to scale

basically analytics so that’s one direction where things are

going in and obviously this also goes into the direction of a

little bit of specialization. This used to be that this engineer

that could build a database because he would also extract

data, run some statistical analysis and show the results to

the head of marketing and now more and more using rules

being a little bit more specialized, now you have people who

specialize into data warehouse and data infrastructure and

you have people who specialize in data platforms, people who

specialize into the algorithm, people specialize into the

analytics is part of data science. And so I think it’s important

to understand all of these roles and there’s definitely some,

that’s definitely some of the trend in the industry. In Parallel,

I think we have to recommend is, I think it’s very important

for people to dev up technological expertise because that has

a lot of value. If you’re able to code in Python, on R and

maybe even you know throw in a little bit of Java and you

understand all of these technologies, it’s extremely powerful

but at the same time, I would warn people against becoming

too attached to certain statistical techniques or certain

machine learning techniques. There’s always gonna be people


who act specialize in deep learning and recurrent their old

(inaudible) you know unless you’re one of those people, you

don’t really need to go in that direction. You also don’t

necessarily need to learn all the different algorithm. I think

it’s more important to understand what are the different

techniques doing and really deeply, like have a deep

understanding of statistics and how you use different things

in different cases and have the ability to learn and then apply

the right model or the right technique to the right problem.

So it’s more important to be able to map out the right

technique to the right problem rather than know every

possible techniques and algorithms that are existing on this

one.

Kirill: Very very powerful advice there. I just all sum it up to all the

listeners and just for myself as well. You are observing a

trend that data scientists becoming more mature as a type of

industry, type of work and therefore some roles are becoming

more specialized. So I guess it’s a good idea for the analyst

and aspiring data scientist to start to look out for what they

are most interested in and eventually end up doing

something that they are passionate about in this field and

also that developing a deep technological expertise in a broad

range of tools and techniques is very important because you

don’t want to get stucked just using that one technique

because this field is constantly evolving and you want to be

able to adapt and learn new skills on the fly as they say. I

think that’s some very powerful advice and I know you’ve

recommended already a great article by the sounds of it by

the creator of random forest. Is there any book, a one book

that you could recommend to our listeners that if they had

the time to pick something up and improve their data science


careers and skills. What could be the one book that you

think they should read?

Ruben: That’s a good question cause I never actually learn data

science from a book. I learned it in school, I learned it by

doing or by looking at websites and forums however like

there is one book that sort of like influence my thinking

around data analysis and really you know cemented my

ideas around acquisition, correlation, how can you torture

data to see certain things, and how that thing might be

wrong and how can you look at the same data sets and how

can you come up with conclusions. And that book is actually

The Signal and the Noise by Nate Silver. It’s a popular book

and it’s a book that is obviously very technical but I think the

ideas in that book is extremely powerful and again it’s this

idea that you know data science is not just a set of

techniques and tools, it’s also a way to think about the

problem and if you don’t think correctly about the problem, it

doesn’t matter that you have the techniques you’re gonna

end up with like wrong insights and conclusions. It is

important to have a deep understanding of like what are you

trying to achieve, how do you like approach a problem, how

do you look at them correctly and then use the right

techniques and tools. So for that reason, I would recommend

The Signal and the Noise by Nate Silver.

Kirill: Lovely. Haven’t read the book myself. Definitely that’s going

onto my list of books that I’m picking up in the near future.

Signal and the Noise by Nate Silver.

That has been a pleasure, Ruben. For our listeners, where

can they find you? How can they contact you, follow you in

any social media, any websites, where can they get in touch

with you?


Ruben: I think the easiest is that getting in touch with me on

LinkedIn. I’m working on a blog and a website. It’s not out

there yet but the link will definitely be on my LinkedIn

profile. So that’s one of the best way.

Kirill: Definitely and I will also include the link in the show notes as

well. If there are any updates I will include those in the show

notes as well there. Thank you very much, Ruben. Really

appreciate you coming on the show and sharing your

insights. I think that this has been a great day. Wonderful

and insightful conversation. Thank you so much for coming

along.

Ruben: Yeah. It was my pleasure. Thanks for inviting me.

Kirill: So there you go guys. That was Ruben Kogel. I hope you

enjoyed the show. You can get the show notes at

www.superdatascience.com/1. There you can also leave me

or Ruben a comment in the comment section at the very

bottom. Ask question or tell us what you thought. Also if you

did enjoy the show, make sure to share it with your friends

and work colleagues so that you can help us spread the word

about the SuperDataScience podcast and I look forward to

seeing you next time. Until then, happy analyzing.



Documents

fifty thousand students. So it’s a great · (background music) Kirill: Hey guys, welcome to the Podcast. I’ve got Ruben Kogel here ... the video quality, and the instructured