Data Science Accelerator Program

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

Data Science Accellerator Program

GoDataDriven

How we teach

Each session we teach will be interactive. Every month we give you one interactive lecture and one hackersession. Both the lecture and hackersession require students to code. The main distinction is that the lectures have more focus on theory whereas the hackersessions have more focus on getting your hands dirty with code. The hackersessions are meant to be fun and engaging while giving the students much freedom. We often notice that the hackersessions end up being the most educational part.

GoDataDriven

Lecture1: Rstudio stackThe first session is meant to reintroduce programming by introducing students to the Rstudio stack. It will immediately teach them to use the new dplyr syntax and introduce them to a proper work environment. The session will conclude with a visit to github and showing students how to create sharable documents.

- intro to programming - explain the concept of a dataframe- r ggplot- r dplyr - rmarkdown - ChickWeight - git

hackersession: webscraping with R

- rvest + html- type casting - ggplot - dplyr

Use R to scrape some video game websites and figure out which heroes of the storm character makes the most sense. When this is done we scrape funda and task people to find the best house in their home town.

GoDataDriven

Lecture 2: SimulationThe second session will focus on a review of statistics and probability. We keep the math light and invest much time in simulation exercizes.• explain basic probability distributions (normal) • what is wrong with the mean of a distribution? • what is a correlation. • PCA/Covariance • explain simulation cases (birthday problem, casino) • explain bayesian thinking • basic hypothesis testing

hackersession banditproblemThe users get an online and offline version of the bandit problem. We turn it into a game. One bandit problem is offline, one is online. We give users keys and we ask them to figure out which banner is best.

GoDataDriven

Lecture 3: iPython stackDuring this session we explore the python stack. Python is an all purpose

language that does more than just data science. We first cover how to write concise elegant code before we delve further into how to do data science with it.

• command line • notebooks • python • jupyter • pandas + numpy • matplotlib

hackersession: build a flask app with pandasThe idea is to build a website that you can query, pandas as a backend.

GoDataDriven

Lecture 4: Linear ModelsIn this session students will be exposed to the theory behind classical linear models as well as more modern machine learning models. The focus will be to understand how these models work and to get a feeling of when to use which model. We will show how to run all the models in both R and python so people understand that you can work independant of the language.

• reminder of statistics + assumptions

• how to measure models: training + test

• linear regression

• logistic regression

• tree models

hackersession: automation in regression + codereviewBuild your own automation script that applies many algorithms to many datasets and benchmark it through brute force. You can use a tool that automates this like caret or just build it yourself. The goal is to find the most robust algorithm.

GoDataDriven

Lecture 5: Optimisation Science Any self respecting course on data science should spend a least a day on the science of optimisation. Operational Research is the backbone of many machine learning algorithms as well as a tool to automate decision making. Today will be a day we talk about decision engineering.

• the maths behind optimisation

• closed form optimisation

• gradient descent

• linear programming

• stochastic gradient descent

• heuristic/genetic approach

hackersession: TSPWe'll play a game. Whoever gets the best travelling salesman solution wins. Pick any tool you like. Just do whatever and fix this problem within a day. We give multiple instances of TSP such that students can learn that not every algorithm will work all the time.

GoDataDriven

Lecture 6: Non-Linear Models

Although very popular, linear models often fail. The main reason is that not every dataset is linearly seperable and in this lecture we will delve very deeply into this.• linear seperability problem • support vector machines • factorization machines • neural network

hackersession: automation in regression + codereviewBuild your own automation script that applies many algorithms to many datasets and benchmark it through brute force. You can use a tool that automates this like caret or just build it yourself. The goal is to find the most robust algorithm. At the end of this

GoDataDriven

Lecture 7: Clustering & EnsembleIn this session we discuss two distinct but important methods in machine learning: clustering and ensemble models. Clustering involves classifying unlabbeled data such that we can predict phenomenon without having labels. Ensemble models are models that combine multiple models together to create a better one. An example of an ensemble model is a random forest but it is easy to create your own.

• hierarchical clustering

• kmeans clustering

• HMM clustering

• normalisation

• ensemble theory

• random forests

hackersession: outperforming ensemblesWe will do a repeat of the basic algorithms dataset but now it is your job to build an ensemble that outperforms a randomforest. More difficult datasets will also be handed out, the goal is to have the students realise that just looking at the data is equally important.

GoDataDriven

Lecture 8: Natural LanguageDuring this session the students will learn about the basics of data mining texts and NLP algorithms. Texts are a very different data structure with a lot of different usecases than we usually have. The end goal of this session is to explain how to make a language detector with basic technqiues.

• cleaning text

• regular expressions

• nltk

• tf/idf

• bayesian filter

• word2vec

• clustering documents

hackersession: Markov Generator of LyricsWe will scrape different websites containing song text and we will then try to train a markov chain to create random sentences. We will also have other texts available for the students to play with.

GoDataDriven

Lecture 9: Time seriesThe problem of prediction in data science becomes different if time is involved. During this session we discuss how to benefit from taking a time series approach, what common methods are and we try to create models that can change over time such that they can fit a real time setting.

• lag variables

• log transform

• autocorrelation

• moving average (window models)

• moving variance

• arma/arima

• real time algoritms

hackersession: stock predictionWe're gonna play a game. Whoever can predict the stockmark the best wins. Use whatever method, you can only pick a portfolio of max 3 stocks.

GoDataDriven

Lecture 10: VisualisationBeing able to communicate data clearly is important for a data scientist. The goal of this session is to explain what makes good visualisations informative and bad visualisations feel like clutter. We will also discuss how to make interactive visualisations with d3 and how to connect it to your own backend.• review of ggplot2 + tufte theory • interactivity with Shiny • basics of front end webdev • d3 • connecting frontend + backend

hackersession: building a custom dashboardWe will give you four interesting datasets and we will leave it up to you to turn the dataset into an interesting app.

GoDataDriven

Lecture 11: Making things scaleSofar we have only discussed how to handle files that fit on one computer. In this session we will discuss Hadoop and Spark and how to use them to handle big datasets. We will focus more on the Spark API because it is more relevant for data scientists and we will spend a significant amount of time explaining when to handle something as a big data problem and when you want to avoid it.

• when to refer to big data

• bootstrapping techniques

• hadoop ecosystem/tools (briefly)

• big data spark syntax

hackersessionWe let the students datamine a large dataset on a cluster that they will have

• provisioning

• R syntax

• python syntax

GoDataDriven

Lecture 12: The group determinesThere are many possible advanced topics that could be discussed but we would like to leave the subject of the last lecture open. Preferably the students will find consensus in a new technology (which surely there is one by now). Otherwise, one of the following subjects can be chosen:

advanced topics• feature creation

• computer vision

• bayesian graphical models

• neo4j vs sql vs nosql

• deep learning

• ethical considerations

• legal considerations

• julia

hackersession:For the last week, each student can work on any project. We are there to help them with anything.

GoDataDriven

End GoalAfter the course the following tasks should be a no-brainer for students:

• get basic insights out of a .csv within a day even if it is a dirty dataset with either python or R. this includes things like outlier detection, type casting

• when given a clean dataset, candidate will be able to run three different algorithms in a day with train/test for regression, clustering or classification.

• recognise when a dataset is too big to handle

• be able to estimate when a project will take a week (shiny app) vs a few months (django app)

• be able to attend a pydata conference and understand half the talks to the degree that they can summerize it in a few sentences

• turn a .csv file into dashboard as a an microservice with an api within a day

Data & Analytics

Data Science Accelerator Program