Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Kirill: This is episode number one with ex-chemical engineer and
now data science wizard, Ruben Kogel.
Welcome to the SuperDataScience podcast. My name is Kirill
Eremenko, data science coach and lifestyle entrepreneur.
And each week, we bring inspiring people and ideas to help
you build your successful career in data science. Thanks for
being here today and now let’s make complex simple.
(background music plays)
Hello and welcome to the very first episode of the
SuperDataScience podcast. I can’t explain how excited I am
to finally get the show off the ground. I’ve had this idea for
literally months now and finally today we’re kicking things
off. And what is the show all going to be about?
The show is going to be about inviting the most aspiring data
scientists in the world and talking to them about what they
do, what their background is, what they’ve learned in their
past on the data science journey and what they can share,
what insights, what tools and methodologies they can share
with us and what we can learn together from them. So, I’m
very excited that you’re here from the very start that you’re
listening to the very first episode. Thank you so much for
being part of this journey. I’m sure that together we’re going
to learn a lot.
And I’m very glad that this very first episode, we kicked off on
a very high note, I spoke with Ruben Kogel who’s a data
scientist at Udemy. So if you’re not familiar with Udemy it’s
the biggest online educational platform in the world.
Currently there are over eleven million students learning
through Udemy so if you haven’t checked them out, definitely
check them out. Our courses are basically anything that you
could potentially imagine and personally I’m an instructor in
Udemy as well. I have over twenty courses there and near
fifty thousand students. So it’s a great learning platform and
Ruben Kogel is one of the head data scientist in one of the
divisions in Udemy. So, a division that works on content and
content marketing and Ruben shared some very powerful
insights about what he does on a daily basis at Udemy and
moreover how he transferred his chemical engineering
background into a data science skill set. How he in taking
ABA and specifically selected the subjects in essential way
that he was able to learn more about data science and get
into that field so make him jump from chemical engineering
to data science through his MBA.
He also talked about communicating insights and how
important it is in the data science role. Also we discussed lots
of other topics such as identifying problems when you’re a
data scientist. How important that is. How to combine data
science and product strategy. That is something that Ruben
does on a daily basis and you’ll learn more about that. And
so that’s a very powerful skill to have especially if you’re
working in the start-up area. The start-up space in company
is predominantly in the silicon valley or any other kind of
location where you got this start up culture. Then we also
spoke about managing a team of data scientists so Ruben
quite had an experience around that he had some tips if
you’re a manager in data science or in that analytics space,
you can pick up some good tips from there.
And also, we talked about managing the inflow of request
and that is valuable to any data scientists to how to manage
the inflow or requests and then Ruben gave us an example of
his approach of the trello board, we talk about the mentors of
data science and of course, we talked about lots of different
analytics tools. We talked about (inaudible) sql and add- on
which I don’t personally know about. We talked about wagon,
we talked about r versus python. So the (inaudible) question
which one is better which is more preferable and we talked
about many many other things in this podcast so I’m sure
you’re going to enjoy it. And we even touched on self-served
analytics so a growing space of the field of data science. So
without further ado, I bring to you Ruben Kogel of
Udemy.com and enjoy.
(background music)
Kirill: Hey guys, welcome to the Podcast. I’ve got Ruben Kogel here
from Udemy. Super excited about this very first episode.
Ruben, welcome! How are you going?
Ruben: Thank you! Thanks for having me over. I’m doing great.
Kirill: Awesome. It’s great to hear you and for those of you who
don’t know, I met Ruben for the first time when I was in San
Francisco a couple of months ago for the first Udemy live
conference. And it was pretty exciting and he ran some great
presentations but just so that we got or everybody gets up to
speed, Ruben, tell us a bit about what do you do? What’s
your title in the company and what exactly do you do on your
own?
Ruben: Yeah sure. So, I’m the senior manager of analytics and
strategy in Udemy and basically what I do is that I help the
content team which is the team that looks at the courses and
the content in Udemy in order to figure out what is in the
catalogue, what’s in the selection of courses, what’s the
quality of courses, how can we improve our catalogue to
make our students happier. In practice, a lot of it might work
has to do with - are we measuring the right thing, are we
measuring satisfaction, is that data available so that the
people who are in charge of bringing in courses know which
courses are good and how do we measure also the selection
of courses so that we can slowly build a better and better
catalogue for our students.
Kirill: Awesome! That’s pretty cool so you apply Data Science
techniques in order to measure those kinds of metrics. Is
that correct?
Ruben: Totally. I mean data science comes in different parts of my
job. So just like the more basic part which is
instrumentation. So I just start to like any data endeavors
you wanna make sure you’re measuring the right things and
that data is available. So for me it means that are we
measuring student satisfaction correctly and is that data
available so that people in the company can access it and
make decisions on it. There’s another level in which I use
data sciences then there like more broader questions are
coming and so that’s what I call like ad-hoc analysis and so
someone might ask me what happens if we remove some of
the low quality courses from the platform. Can we predict all
the impact on the satisfaction of revenue? So that’s kind of a
question where you need to know the structure of the
problem but also come up with some predictions and use
some techniques to like evaluate what would be the impact
was maybe some conference and roles and there’s like
another type of question that can be, well, we know that
there’s a variety of courses and they all have different quality
scores and we wanna know what about these courses that
driving these quality scores and you know is that the audio
quality, the video quality, and the instructured delivery so in
that case, you know data science comes in to play both in
terms of structuring the problem but also running some sort
of a statistical analysis to extract the importance of the
different variables and come back with an answer saying like
– “well I think this variable is the most important and you
know if you move that variable by like 1% that will have an
impact to all student ratings but by that much.” So that’s
another example.
Kirill: That’s pretty cool. So, a kind of two main types – one is
metrics of existing metrics and how you can tweak them to
improve the experience of students but another one is where
the first image where you’re doing kind of behavioral
analytics and predictive behavioral analytics to see how you
can change things so that the future experience is going to be
better. That’s pretty cool on my view and it’s great that you
get to do both parts in your role as a data scientist at Udemy.
Ruben: Yeah, it’s very very cool. Actually what I really like about my
role here is it’s really the interface between data science and
sort of like strategy product because I get to work on data set
and they get to a lot of data analysis but at the end of the day
the people I talk to are VP of content toward you know
director of course acquisition and people who were like you
know they are like in the width. They are doing the business
and I get to give them recommendation, I get to influence,
know their decisions or influence the product with data so
it’s this really cool interface between data and business.
Kirill: Definitely and that’s what I also found that the most
interesting and the most impactful roles and careers happen
on the verge of two fields whether it’s for example you could
take like Physics and chemistry or biology and chemistry but
that’s very exaggerated examples but even in data science
one thing is just to do analytics and all the whole thing is to
do the analytics and at the same time convey the findings
and work with those people that use analytics. And just on
that as well, how do you find conveying such complicated
analytics to your stakeholders that like you said the VP of
content. Is there any certain approaches that you use or any
tips you can give to our listeners on communicating these
findings to senior stakeholders in the company?
Ruben: You know I’ll start by saying that any analysis that you do is
useless and irrelevant if you’re not able to communicate the
findings to people. So, communication is a huge part of being
a successful data scientist. And, but there’s like choose..or
like roes that I try to use when I communicate things, one is
translate technical concepts into something that people can
understand so you don’t just like talk about conference
(inaudible). You can talk about maybe your confidence in the
data or you can select. Well I think you know, the prediction
or like the predictive revenue will be between those two
balance. You don’t have to say you’re ninety-five percent
confident because that doesn’t add a lot of value but at the
same time you also wanna convey precision in your
communication and make sure that you, you don’t throw in
answers like “yeah I think we should do A because.. I think
it’s important that you convey like “ well I look at the data
and if we do A, we can lift revenue by this amount and
maybe just a little bit of an individual in terms of like what I
think we can lift revenue but this is what the analysis is
doing. So it’s like this balance between being a precise and
showing that you’ve done your homework and showing like
that this is a written in data but at the same time translating
it to layman words and stripping away any technicalities that
don’t add a lot of values to your message.
Kirill: Totally agree. I’ve got a classic example on that, how often
stakeholders especially senior stakeholders are very skeptical
about sample sizes for instance right so if you run some
analysis on a sample size of 157 or 300 they might be prone
to saying actually we want a sample size of 10,000 but you
as a data scientist know that that sample size that you ran is
significant. Statistically significant so it is important to
convey these findings in a way that when you are confident
then with yourself then you don’t have to go into all of that
detail all to explain exactly the methodology behind it but
actually just convey the confidence in the way you present,
the way you position your analysis so totally agree on that
one.
The interesting thing that I’d like to ask you and probably a
lot of our listeners would be curious about is – can you tell
us a bit of your background. So how did you come into being
a data scientist and progress, of course you progress further
in your career and now you’re the head of analytics in that
department but originally how did you get into data science
because this is quite a new field and back in the day it wasn’t
being taught in university so how did you get here and what
are the steps?
Ruben: It’s a very interesting question because I had a very
meandering path. I didn’t start at all in data. My background
was in applied physics and then I switched to material
science and I was a chemical engineer for many many years.
I was dealing with data but not the type of analysis that you
do when you are data scientist. It was a lot rougher and a lot
sophisticated and what happened is that I actually was
looking to like transition to something different. I went to
business school and you don’t typically think of a business
school as a place you learn for data science but I had this
opportunity to take to use statistic courses and one like data
mining course that I suddenly fell in love with the field and
like the more I learn about it, then the more I was really
intrigued and so I really learned the theory in business
school and that I had this opportunity to come work to
Udemy and applied some of my knowledge and that’s how I
like started my career. So it is pretty reason and It was also
like a pretty stark contrast from what I was doing before.
Kirill: Definitely. That’s a great jump and like you say, one wouldn’t
expect that you would learn all the necessary data skill at
business school but I guess you picked the right subjects
and that’s a great testament to how lucrative the data science
field is and tell us the skills that you developed as a chemical
engineer or working in that field. Is there any way to leverage
them currently because data science comes from all the
different areas? Some people come from acting classes, some
people can come from economics or finance but coming from
a chemistry background is there any skills or any particular
mind sets that you can share of us that you can leverage
with your current work as a data scientist?
Ruben: Yeah. Totally. Surprisingly, it is not the data skills that I
leverage from my chemical engineering background. It’s more
the problem solving thing in communication skills. I think
any engineering work has a heavy component of
troubleshooting and problem solving and so I was doing a lot
of that in my job and it really forced me to come up with like
a very systematic approach to breaking down a problem.
Coming up with a hypothesis and I think those hypotheses
and be extremely systematic and organized about and
structured about my thinking. So I definitely learned that
from my engineering background. The only thing that I
learned is, as I mentioned communication skills. As an
engineer especially like in my position, I was doing a lot of
like account management. I had also to translate a lot of very
complex technical experiments and results into something
that the business people could understand and the ability to
summarize complex concepts and notions into a few you
know whether it’s a slide or whether it’s an email you know
and prepared draft that really sum up the insights, it’s very
important that something that I learned in my engineering
job and something that had thrilled me also in my current
career.
Kirill: Sounds really cool and definitely the problem solving skills
that’s a very valuable thing to have when you’re dealing with
data science challenges but from what you’re saying I
gathered that your communication skills had played a very
significant role in terms of your success and how would you
recommend because – my thinking is that a lot of times when
people are starting out into the field of data science now that
it’s getting more and more popular, sometimes they can get
pitched on a hold into certain roles where they are
performing some analytics but they don’t have the exposure
to go and share their insights with stakeholders. They were
just performing certain sql queries, or certain analytical
procedures but they don’t actually have a chance to
communicate insights. So what would be your suggestion be
like from the top of your head for people or our listeners who
might be in that situation to somehow start developing those
communication skills nevertheless.
Ruben: Yeah. I think there are two ways you can do that – one is,
even in your day to day interaction with customers. I would
think of like analytics of data sciences you know you have
customers inside the company whether they’re technical
customers or business customers, people ask you questions
or ask you to perform data analysis to give them an answer
so whenever you interact with your customers, you can
always like good extra mile and structure your answer not
just as the output of a regression analysis or as a SQL query
but like you could try and explain like why you think this is
the right thing to do, what that means in practice, what
would be your recommendation so always like push the
results path not just the technical output but and package it
in a way that shows that you thought about the implications
and the meaning of the analysis so that’s one thing.
The second thing is I also like always recommend people and
that’s true of people who worked for me or people who might
working in other position that whenever you are given a
problem, it’s always a good practice to try and dig in to
exactly the problem the person is trying to solve. Because
oftentimes, someone will leave comments like “hey can you
pull this data or hey can you build a dashboard for that or
can you do an analysis around this” like really, they are
trying to solve the problem and they may not tell you what’s
the problem trying to solve it so if you engage with that
person or customer try to really get to the bottom of what
problem they are trying to solve. All of a sudden you are,
starting a conversation around what are you trying to solve,
what can you bring besides just the approved analysis, how
can you help them frame your problem and so all of a sudden
you are engaged in communication, you are engaged into like
expressing a problem, breaking in down and choose some
more elemental data problems and coming back with a
solution that addressed and really on the line problem. So
that’s one way in which you can like push the boundaries of
your current job and really expand into delivering like more
value that is built upon this communication and built upon
just like understanding the underlying problem. So that’s
number one.
I think number two also is generally speaking, analysts have
like this unique ability to look into the data, come up with
some insights that no one else in the organization has and so
as an analysts, you also have the opportunity to maybe start
addressing problems that other people may have also or
some people may not have thought about and you can create
value by being a little proactive about what you think we
should be looking into, what you think would be useful
analysis with some useful insights for other teams. So there’s
also that opportunity because no one else can really look into
the data you are the only person who has access to the data
and also can bring out the insights.
Kirill: Beautiful. Love it. Especially the - what is exactly your
problem that is one of the best skills to have as a data
scientist to help people identify the problem because they
often come to just data instead of actually identifying
problem and regardless of what level you are at as a data
scientist or data analyst eventually that skill is the one that
will push you forward and combining it with what you said
about the proactive approach that’s where you become the
doctor for the organization and you walk around and you
measure what’s going wrong and how you can help fix it. So
yeah definitely agree with those two. Those were some of the
very powerful skills to have. And you mentioned that you are
in-charge of a few people. Can you tell us more about that
how many people are you in-charged of and how did you find
to managing the team of analysts or data scientists? What
are the challenges that you face on a day to day basis?
Ruben: Right now my team is down to a one person so at the
HayDay, we had a bigger team but you know things moved
quite quickly. In Silicon Valley, so I have one person
reporting and hiring at least one person for the moment. Well
the thing that I find that is most challenging in terms of more
like managing and growing analysts or data scientists is
oftentime there‘s this tension between trying to please your
internal customers and trying to make as many people happy
as possible in the shortest amount of time. And trying to
build long term value and work on like longer term projects.
The way I think of like data scientist is like you’re the end of
the chain. There’s a lot of people in the organization you
know they manage project, they ask other people to do things
and you know eventually when you put all the contributions
together you come up with a product but when you are in
data science, you don’t ask other people to do things or it’s
very rare like usually, you’re the last person that they ask
and so there’s a lot of people coming to you and end up
having a ton of request and managing the inflow of request
and managing the different things they’re working on is can
be very challenging oftentime like a new analyst, they tend to
gravitate your words maybe like the more short term projects
they think that the last person that’s sent to do is like “oh
yeah yeah, I’ll do it right away” and to the detriment of
working on the bigger longer term and more impactful project
so that’s one of the challenge.
Kirill: That’s definitely true and that kind of flows into the art of
saying no to people who will come to you as a data scientist
especially when you’re running a department and when you
have a few successes you’ll find even more of the other
departments from the company and people and stakeholders
coming to you with requests. So how do you say no? How do
you tell people that, “hey, you’re project is really cool and I’d
love to work on it but at the same time I’ve got other
commitments? What are your tips around that?
Ruben: You know you don’t really say no. The truth is (inaudible)
you make it very clear what your priorities are and one of the
tool that we use here in Udemy is a Trello board. So Trello is
a for activity tool that, it’s like virtual post-its. Essentially it
enables you to show a public board of what you’re working
on, what are the stages of the different projects, what are the
stakeholders of the different project and if someone ask you
to do analysis you can say ok no problem, can you just
record on my board and they’re quickly realize that their card
is you know one of 50 other cards and now there’s actually
prioritization process for the team to start working on it. So
that’s one which we’re working in.
The second part is like, really what you wanna do is like you
don’t wanna work on each of the jewel request and sell them.
What you wanna build is a scalable infrastructure. You only
build scalable analytics. What that means is instead of
answering the same question over and over again, or instead
of like pulling data for everybody in the company, you build
self-serving tools. You build tables, dashboards, web UIs that
enable people to access the data that they need so that they
don’t ask you to do the data and pull the data analysis
anymore. So you can free up some of your bandwidth to work
on the more interesting project.
Kirill: Awesome. Love it. The self-serving analytics has becoming a
more and more popular concept in the world of data science.
Specifically for that reason – to free up data scientists and to
empower the end user to do their own and so just on that
what are the tools that you use in Udemy, if you can disclose
this for self-serve analytics side of thing?
Ruben: The basis of the self-serve analytics is creating a set of
summary tables that have all the relevant information that
cover ninety percent of the use case. What I mean by that is,
for example (inaudible) questions that come in Udemy is like
“Oh do you have a list of course that have been published
this month or what is its total revenue of courses that were
published last month or can you look up a particular
instructor and see how many enrolments or reviews she has
in her courses. So there’s like a limited set of question that
come over and over again and instead of like building queries
or like every time she would extract the information she
would just build one or two tables that have all that
information summarized so people can access the data
directly by either acquiring the table by sequel or like looking
it up by the dashboard. So in practice, what we’d do is we
have all our data information flow in your (inaudible) by
amazon and that lets you build tables on top of like raw
tables so we have what we call this summary tables built
upon the raw. Or they can power dashboards between used
chartio Ui interface and they can use the chartio to extract
whatever information they need.
Kirill: Okay so red shift and chartio. Yep. Those are the two tools. A
very nice answer. You store amazon in a WS for storage of
data, correct?
Ruben: Uhuh!
Kirill: And how did you find that? Has that been a reason for
transition of your organization or that’s been always the
case?
Ruben: No. It’s been a transition about a year and a half ago we used
to have only our traditional MySQL own server and then we
started exploring red shift. (inaudible) red shift and we saw
that it was much more powerful in terms of doing analytics
because like typical My SQL database is optimized for writing
but not honestly for creating new tables. Red shift is
optimized for doing a lot of joints a lot of analysis and reading
data. So, we’ve been using red shift for like the last year and
a half and we’ve been scaling the size of our cluster as our
data grows and as our analytics team grows and it’s been
serving us pretty well actually.
Kirill: I’ve heard a lot of comments about that that in AWS you can
scale with your needs and that’s one of the biggest advantage
of AWS that as your organization grows and need more
capacity to empower then to scale that you don’t have to
purchase that in advance and also there are month to month
or other type of plans there. Very convenient. Usually the
only said back organization that have common of this as well
is an organization that deal with customer data, client facing
data, such as Udemy. They would have certain regulatory,
maybe certain regulatory issues with outsourcing the storage
of the data to the cloud or to external systems like AWS. Did
you face any of that when you’re making the transition?
Ruben: Not really in the sense that the data is still secure and
actually in the united states the only data that is being
heavily regulated is health data so if you have the sort of the
health regulations that make it more difficult to work with
the cloud or you have to work with certified vendors that can
really handle the start-ups the regulation about how you
handle health data for the rest AWS serve a large range of
start-ups and they all have sensitive data but you know they
are setup to handle the sensitive data correctly so there’s not
much concern around that. The only thing is that sometimes
you want data to be used for web apps in which case you
know reading from red shift is not the best way to populate
field in your website or web app so you might need a second
representative of data that is faster to read based on calendar
storage.
Kirill: So we’ve started delving into the tool so we can move now
into the management tools like Trello which I found pretty
interesting how you get somebody to post a post-it in Trello
on your board and then they realized that “hey there are
project that’s going to be prioritized and at some point yours
might not be done very quickly” and now we moved on to red
shift. What are the other tools that you use on a day to day
basis in your analytics role.
Ruben: The two technology that I use you know all the time is SQL or
for that case is SQL which is the basis for red shift and the R
so I do much of my analysis in R. Basically that’s the
language that I learned and I am very familiar with it. Some
other people at Udemy use Python. It really varies. But I
think one of the tool are they’re very convenient and flexible
and those are the tools of the data scientists nowadays.
Either python or r they have like the packages and installers
open source. It grows as a very active community so those
are typical for data analysis. In terms of like the tool that I
use where I actually do my SQL, there’s a really neat
company called Wagon. This have a better product but they
right now have the best SQL editor especially for post
(inaudible) SQL. It’s really neat because you can organize
your queries into different folders, everything is always save
on the cloud so nothing is lost and it’s very like does a neat
interface and it’s very easy to use.
Kirill: Awesome, so that was Wagon?
Ruben: Wagon, yeah.
Kirill: How do you spell that?
Ruben: W-A-G-O-N
Kirill: Beautiful. So I haven’t heard of that one before. Something to
definitely check out. And yeah. Very interesting how your
organization has a split between r and python. I guess in the
start-up world it’s more common but in the more larger
organizations that’s been for a while that have a lot of legacy
behind them usually analysts don’t have that luxury of being
able to choose between the two. Definitely R is something
that a lot of our listeners are interested in cause you know
you maybe share a couple of packages or a couple of
techniques that you most commonly use when you’re coding
in R.
Ruben: Yeah totally so I’m gonna probably disappoint you and your
listeners. I don’t use like a lot of different packages and
advanced techniques. I tend to stick to like basic R in fact I
don’t even run like others do. I just use like basic R, I don’t
use GG plateau. I just use like the basic R graphics. I think
part of the reason is I don’t see R as a way to produce
extremely sophisticated reports or analysis. I use R mostly to
extract the information that I need and run like the basics,
the basic analysis so the typical thing that I would do in R is
like I would just import a csv file, read it, do a literal
cleaning, although I usually prefer to do my cleaning in SQL
because I don’t think you should be doing the cleaning at all.
I think really the cleaning should occur upstream and you
only launch R to do some data exploration, you know create
some graphs and do some statistical analysis. Typically, I
would run regression analysis I use to eliminate to do lots of
regression, I use (inaudible) to use random model or
understand the relative importance of the different features
and to your model so that’s kinda like my usage of R.
Kirill: Wonderful. I totally agree of that. R is definitely a very
powerful tool and every analyst, every data scientist has the
right to use it in the way they prefer and you kind of using r
in a very lean approach that can be very powerful as well.
The question – the million dollar question would be R versus
Python. What are your thoughts and why did you end up
picking R?
Ruben: My answer might disappoint you again. Like I never really
chose between the two. I started with R and it fits all my
need. I spoke to a couple of people who use python and I
quickly realize that the type of analysis that I was doing, you
know offline analysis, building offline model, trying to
understand the drivers of a particular metric, Python would
be more complex and would not add a lot of value. What I
mean by that is in my day to day job, I don’t build actual you
know data products. I don’t try to deploy a predictive model
onto our infrastructure and therefore I don’t needed to be in
Python. All I do is that download some data generally from
red shift and I try to build some model to understand what’s
driving this metric up or down and for this type of use case,
from what I read, R is like the simplest and more a direct way
of announcing the data and it’s also like the language that I
know.
Kirill: Beautiful. Definitely that is also the case in many situations
when you start of one and then you just keep going of that
one if it suits your needs and what’s the point of changing.
Totally appreciate that. Alright. So That’s really cool and we
went into some detail on the tools you use and some
techniques. I love that part of conversation. Let’s move on to
a bit of more, some of the more softer stuff. For instance, I
can see that you’ve changed and made this transition to data
science from chemistry and you’ve never looked back. You
were like powering trhough it and progressing on your career,
growing other data scientists on your team and acting as a
mentor. But along the way did you have any influence that
helped you become and persevere in this data science career
and maybe some mentors that you might have had or
hobbies or some life changing events or even some articles or
something that has really influence to you and helped you
along this career path as a data scientist.
Ruben: There was a big learning curve for you I’ve never exercised as
a data scientist and in fact I’ve didn’t really had a mentor at
Udemy and so I had to figure out a lot of things by myself or
ask the people outside and so enjoy. I really encourage
people to seek out mentors . In my case I had a good friend of
mine who had a bit of a headstart on me. He started doing
data science for five years before me. He is someone who has
strong opinions but also he’s very thoughtful about the
different approaches and choice that he was making and I
often you know had conversation with him. We had like a
regular coffee where we exchange on technologies and
techniques. I found this interaction with him very helpful. I
went to different conferences although I would say all of that
conferences were helpful but the one that I really like was the
Airbnb in san Francisco, the conference called open air
which I really appreciate it and also followed some blogs and
newsletter. There’s a newsletter that I really like called
datascienceweekly.org and it’s a collection of interesting
articles about 10-12 interesting articles every week. I don’t
read all of them I just pick the one that seems interesting
and if you know if I pass the first paragraph and it’s really
interesting then it’s for read and then like slowly I build a
catalogue of tips. Soft thoughts that I think very useful and
in particular I can’t remember if it’s thru this newsletter or
maybe something like Linkedin online. I found someone
posting in an old article from Leo Breiman. The guy who
created the random forests that he published in 2001 which
really resonated with me so I would really encourage anyone
who is considering data science or starting in data science or
someone who’s even like more advanced in their career, read
this article because for me it really has expressed exactly the
way I feel about the tension between statistics and machine
learning and the tension I feel between building experimental
models and predictive models and he does that in a very
compelling and neat way. So the article is called Statistical
Modeling: The Two Cultures by Leo Breiman which published
in statistical science in 2001. I find it like the best to read to
really get some groundings in data science.
Kirill: Wow. Wonderful. Definitely I haven’t heard of that article but
definitely will check it out. Sounds like a great read and so
that’s by Leo Breiman. Will definitely include that in show
notes. Next question I would have is if you could share of us
any recent wins of data science, wins that you’ve had in your
department at Udemy.
Ruben: Two of them come to mine. My team has been responsible in
building the new spam filter on Udemy so that’s the filter
that determines whether a review is trustworthy or not and
the old spam filter was built in a set of rules. So it was not
even a naïve bayes, it was just a set of rules that totally make
sense at the time but overtime you know people learn how to
circumvent the rules. They’ve learned that the logic filter and
they was a lot of none trustworthy spammy content on
Udemy and my team tackle this problem and we were able to
improve the accuracy of the spam filter by 8X factor so it’s a
big win for the team but mostly for the company as a whole.
That was one of them. The second one that comes to mine
was moving and ad-hoc analysis that we did couple of weeks
ago so you are an instructor in Udemy. You know that we
changed the pricing strategy and so there was a resulting
change in the student behavior and my team looked at final
price, list price, discount, all of these influence to the
decision and whether if we can build a model how student
would react to them to different pricing strategies and so it
was a very simple model and buying in nothing really
sophisticated complex and it had the ability to explain the
data that we have observe din the past and so for that reason
was powerful both because it has experimental power but
also it was built simple so people could understand what it
meant.
Kirill: That’s very cool that 8X accuracy spam filter. That’s a
significant improvement and model for behavior of different
content that’s also very interesting. As you say I’m an
instructor in Udemy, I can see the backend of these things
and how they run in backend, I can definitely see how
changes coming to play and how the platform is growing so
it’s very exciting to know that you’re behind all of those
changes. That’s really cool!
What’s your favorite thing about being a data scientist?
What’s that one thing that excites you to get up and go to
work in the morning and excites you to do your work and
what is that thing that drives you to keep going?
Ruben: I think the most exciting thing is the intellectual challenge.
It’s the fact that you are always facing new and unsolved
problems and you are the person that’s being asked to solve
that problems. And sometimes it’s just a data problem.
Sometimes it’s more complicated than that. Sometimes you
have to structure the problem and come up with the right
data questions, solve them and give them answers that
constant intellectual challenge is really what drive me.
Kirill: And for our listeners, from your perspective, from what
you’ve seen, what you see currently in the field of data
science, how you’ve seen it evolve since you’ve joined the
ranks. Where do you think this field is going? What do you
think our listeners should prepare for in the future? What
should they focus on? What skills should they develop or
what techniques should they think about it or generally
where do you think this whole field is going?
Ruben: Yeah. That’s a good question. It’s hard to really predict where
this is going and to like a generalized view, I can mention a
few trends and I can mention also a few area also where I
think people can really make a difference and show like and
add value. So right now there’s a lot of talks around data
platform. So it’s not just that you had a data infrastructure,
you’re mining the data and you’re building some models to
work some insights but there’s also this idea that in order to
operate a efficient data team and efficient analytics team you
need to have a better platform that enable data scientists to
deploy experiments, run experiments, quickly build and
validate models. So there’s that aspect of building the
pipeline then the workflows that enable people to scale
basically analytics so that’s one direction where things are
going in and obviously this also goes into the direction of a
little bit of specialization. This used to be that this engineer
that could build a database because he would also extract
data, run some statistical analysis and show the results to
the head of marketing and now more and more using rules
being a little bit more specialized, now you have people who
specialize into data warehouse and data infrastructure and
you have people who specialize in data platforms, people who
specialize into the algorithm, people specialize into the
analytics is part of data science. And so I think it’s important
to understand all of these roles and there’s definitely some,
that’s definitely some of the trend in the industry. In Parallel,
I think we have to recommend is, I think it’s very important
for people to dev up technological expertise because that has
a lot of value. If you’re able to code in Python, on R and
maybe even you know throw in a little bit of Java and you
understand all of these technologies, it’s extremely powerful
but at the same time, I would warn people against becoming
too attached to certain statistical techniques or certain
machine learning techniques. There’s always gonna be people
who act specialize in deep learning and recurrent their old
(inaudible) you know unless you’re one of those people, you
don’t really need to go in that direction. You also don’t
necessarily need to learn all the different algorithm. I think
it’s more important to understand what are the different
techniques doing and really deeply, like have a deep
understanding of statistics and how you use different things
in different cases and have the ability to learn and then apply
the right model or the right technique to the right problem.
So it’s more important to be able to map out the right
technique to the right problem rather than know every
possible techniques and algorithms that are existing on this
one.
Kirill: Very very powerful advice there. I just all sum it up to all the
listeners and just for myself as well. You are observing a
trend that data scientists becoming more mature as a type of
industry, type of work and therefore some roles are becoming
more specialized. So I guess it’s a good idea for the analyst
and aspiring data scientist to start to look out for what they
are most interested in and eventually end up doing
something that they are passionate about in this field and
also that developing a deep technological expertise in a broad
range of tools and techniques is very important because you
don’t want to get stucked just using that one technique
because this field is constantly evolving and you want to be
able to adapt and learn new skills on the fly as they say. I
think that’s some very powerful advice and I know you’ve
recommended already a great article by the sounds of it by
the creator of random forest. Is there any book, a one book
that you could recommend to our listeners that if they had
the time to pick something up and improve their data science
careers and skills. What could be the one book that you
think they should read?
Ruben: That’s a good question cause I never actually learn data
science from a book. I learned it in school, I learned it by
doing or by looking at websites and forums however like
there is one book that sort of like influence my thinking
around data analysis and really you know cemented my
ideas around acquisition, correlation, how can you torture
data to see certain things, and how that thing might be
wrong and how can you look at the same data sets and how
can you come up with conclusions. And that book is actually
The Signal and the Noise by Nate Silver. It’s a popular book
and it’s a book that is obviously very technical but I think the
ideas in that book is extremely powerful and again it’s this
idea that you know data science is not just a set of
techniques and tools, it’s also a way to think about the
problem and if you don’t think correctly about the problem, it
doesn’t matter that you have the techniques you’re gonna
end up with like wrong insights and conclusions. It is
important to have a deep understanding of like what are you
trying to achieve, how do you like approach a problem, how
do you look at them correctly and then use the right
techniques and tools. So for that reason, I would recommend
The Signal and the Noise by Nate Silver.
Kirill: Lovely. Haven’t read the book myself. Definitely that’s going
onto my list of books that I’m picking up in the near future.
Signal and the Noise by Nate Silver.
That has been a pleasure, Ruben. For our listeners, where
can they find you? How can they contact you, follow you in
any social media, any websites, where can they get in touch
with you?
Ruben: I think the easiest is that getting in touch with me on
LinkedIn. I’m working on a blog and a website. It’s not out
there yet but the link will definitely be on my LinkedIn
profile. So that’s one of the best way.
Kirill: Definitely and I will also include the link in the show notes as
well. If there are any updates I will include those in the show
notes as well there. Thank you very much, Ruben. Really
appreciate you coming on the show and sharing your
insights. I think that this has been a great day. Wonderful
and insightful conversation. Thank you so much for coming
along.
Ruben: Yeah. It was my pleasure. Thanks for inviting me.
Kirill: So there you go guys. That was Ruben Kogel. I hope you
enjoyed the show. You can get the show notes at
www.superdatascience.com/1. There you can also leave me
or Ruben a comment in the comment section at the very
bottom. Ask question or tell us what you thought. Also if you
did enjoy the show, make sure to share it with your friends
and work colleagues so that you can help us spread the word
about the SuperDataScience podcast and I look forward to
seeing you next time. Until then, happy analyzing.