A data scientist's study plan

Having fun with stats, maths and games in life!

Adjunct, MoT and CS&E Department

Tandon School of Engineering

N e w Y o r k U n i v e r s i t y

1 / 1 / 2 0 1 6

Raman Kannan

A Study Plan

to become a

practicing data

scientist!

Outline for having fun with stats, maths and games in life

A Study Plan to become a practicing data scientist!

Raman Kannan

Adjunct, MoT and CS&E Departments

Tandon School of Engineering

NYU

Contents Introduction .................................................................................................................................................. 4

Basics: Khan Academy ............................................................................................................................... 4

why now, perfect storm ........................................................................................................................ 4

advances for computing hardware, networking, tools for communication ......................................... 4

introduction to data .............................................................................................................................. 4

sample/population ................................................................................................................................ 5

iid ........................................................................................................................................................... 5

bias ........................................................................................................................................................ 5

Relationship .......................................................................................................................................... 6

univariate regression ............................................................................................................................ 7

multivariate ........................................................................................................................................... 8

logistic regression ................................................................................................................................. 8

Linear Algebra ........................................................................................................................................... 8

matrices, identity,square, rectangular,symmetric ................................................................................ 8

operations:transpose, inversion,decomposition .................................................................................. 8

roots, positive definiteness,eigen values .............................................................................................. 8

cholesky, principal components, singular value decomposition .......................................................... 8

Applications............................................................................................................................................... 8

analytics ................................................................................................................................................ 8

descriptive ............................................................................................................................................. 8

predictive .............................................................................................................................................. 8

prescriptive ........................................................................................................................................... 8

learning and intelligence need big data 3V........................................................................................... 8

dimensionality reduction .......................................................................................................................... 8

unsupervised learning ............................................................................................................................... 8

clustering ............................................................................................................................................... 9

supervised ................................................................................................................................................. 9

classification, ......................................................................................................................................... 9

measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity ............................................ 9

semisupervised, hybrid ............................................................................................................................. 9

network, hidden, feedback, selfcorrecting ........................................................................................... 9

deep learning, Boltzman Machine, Markov Chain .................................................................................... 9

Information Retrieval Entropy, Gain ......................................................................................................... 9

Introduction Paraphrasing Einstein, The problem of "qualified labor" shortfall cannot be solved if we continue with

the same mentality that created it. We need to be disruptive. There is no need for university or college

degree or any structure. Mathematics and analytics is universal and a basic language and anyone

(returning veterans, dropouts, can become proficient, if you are willing to be disruptive like

Gates,Zuckerburg). So with that hope, this document attempts to layout a path to become a practicing

data scientist.

It could at first be daunting. But, dont be intimidated! Even though I respect Malcolm Gladwell, I have to

encourage you to ignore Malcolm's 10000 hour rule. Anyone with passion, determination and discipline

can become a data scientist in less than 10000 hours...may be 6 months approximately 4 hours per day *

5 days per week * 4 weeks per month * 6 months = 480 hours. Because all this stuff is basic and mostly

intuitive and lurks in the subconscious realm of cognitive apparatus, even that of monkeys, dogs,

leopards and of course human beings. Otherwise, we could not catch a ball or frisbee or a prey. I assure

you none of this involves String theory, Reiman surface, Hilbert dimensions or Tichnoff Embedding

theorem. We already do so much of this subconsciously, we just have to transfer them to the conscious

realm of yourself.

Let us go!

Basics: Khan Academy

why now, perfect storm

advances for computing hardware, networking, tools for communication

introduction to data

operational filter> transactional vs master data

domain filter > what values can it hold >

categorical/qualitative (nominal,ordinal)

numerical/quantitative (interval, ratio)

Statistics Refresher

sample/population

iid

bias

randomness

outlier, anomaly, Bonferroni test

sample means, convergence to population mean

CLT Central Limit theorem

LLN Law of Large numbers

Benford Law small digits

central tendencies measures of, moments

mean (median,mode),variance (standard deviation),skew,kurtosis

comovement, relationship correlation, covariance

distributions normal (Gaussian),poisson,uniform

probability

basic properties,

certainity, uncertainity,

impossibility,

knowable,

unknowables,

known unknowables,

unknown unknowables

counting/frequentist

discrete, conditional, joint probabilities

Bayesian probability

continuous probability

Relationship

regression

parametric

nonparametric

independent vs dependent variables

dependent also known as response

independent aka regressors,predictors

univariate regression

linear relationship y=mx+c

quality of the relationship, goodness of fit

pvalue, null hypothesis, rsquare

assumptions

autocorrelation

multicollinearity

heteroskedasticity nonconstant variance

tests of normality

tests of randomness

transformation

mixtures

standard normal

lognormal

multivariate

logistic regression

odds ratio

Linear Algebra

matrices, identity,square, rectangular,symmetric

operations:transpose, inversion,decomposition

roots, positive definiteness,eigen values

cholesky, principal components, singular value decomposition

Applications

analytics

descriptive

predictive

prescriptive

learning and intelligence need big data 3V

dimensionality reduction

unsupervised learning

clustering

supervised

classification,

measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity

semisupervised, hybrid

network, hidden, feedback, selfcorrecting

deep learning, Boltzman Machine, Markov Chain

Information Retrieval

Entropy

Gain

References (2 B CONTD) KhanAcademy.com

http://tutors4you.com/probabilitytutorial.htm

http://www.mathportal.org/linear-algebra/vectors/dot-product.php

http://www.stat.berkeley.edu/~brill/Stat153/tstests.pdf

http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

http://singhal.info/ieee2001.pdf Introduction to Information Retrieval

http://www.cs.columbia.edu/~gravano/Qual/Papers/singhal.pdf

http://times.cs.uiuc.edu/course/410/note/mle.pdf

http://www.dataschool.io/simple-guide-to-confusion-matrix-term

http://www.stat.berkeley.edu/~brill/Stat153/tstests.pdf

http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/

http://singhal.info/ieee2001.pdf

http://www.cs.columbia.edu/~gravano/Qual/Papers/singhal.pdf

http://times.cs.uiuc.edu/course/410/note/mle.pdf



Acknowledgements To all those who have taught me everything I have learned in life, starting with my mother.

Data & Analytics

A data scientist's study plan