Upload
raman-kannan
View
82
Download
3
Embed Size (px)
Citation preview
Having fun with stats, maths and games in life!
Adjunct, MoT and CS&E Department
Tandon School of Engineering
N e w Y o r k U n i v e r s i t y
1 / 1 / 2 0 1 6
Raman Kannan
A Study Plan
to become a
practicing data
scientist!
Outline for having fun with stats, maths and games in life
A Study Plan to become a practicing data scientist!
Raman Kannan
Adjunct, MoT and CS&E Departments
Tandon School of Engineering
NYU
Contents Introduction .................................................................................................................................................. 4
Basics: Khan Academy ............................................................................................................................... 4
why now, perfect storm ........................................................................................................................ 4
advances for computing hardware, networking, tools for communication ......................................... 4
introduction to data .............................................................................................................................. 4
sample/population ................................................................................................................................ 5
iid ........................................................................................................................................................... 5
bias ........................................................................................................................................................ 5
Relationship .......................................................................................................................................... 6
univariate regression ............................................................................................................................ 7
multivariate ........................................................................................................................................... 8
logistic regression ................................................................................................................................. 8
Linear Algebra ........................................................................................................................................... 8
matrices, identity,square, rectangular,symmetric ................................................................................ 8
operations:transpose, inversion,decomposition .................................................................................. 8
roots, positive definiteness,eigen values .............................................................................................. 8
cholesky, principal components, singular value decomposition .......................................................... 8
Applications............................................................................................................................................... 8
analytics ................................................................................................................................................ 8
descriptive ............................................................................................................................................. 8
predictive .............................................................................................................................................. 8
prescriptive ........................................................................................................................................... 8
learning and intelligence need big data 3V........................................................................................... 8
dimensionality reduction .......................................................................................................................... 8
unsupervised learning ............................................................................................................................... 8
clustering ............................................................................................................................................... 9
supervised ................................................................................................................................................. 9
classification, ......................................................................................................................................... 9
measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity ............................................ 9
semisupervised, hybrid ............................................................................................................................. 9
network, hidden, feedback, selfcorrecting ........................................................................................... 9
deep learning, Boltzman Machine, Markov Chain .................................................................................... 9
Information Retrieval Entropy, Gain ......................................................................................................... 9
Introduction Paraphrasing Einstein, The problem of "qualified labor" shortfall cannot be solved if we continue with
the same mentality that created it. We need to be disruptive. There is no need for university or college
degree or any structure. Mathematics and analytics is universal and a basic language and anyone
(returning veterans, dropouts, can become proficient, if you are willing to be disruptive like
Gates,Zuckerburg). So with that hope, this document attempts to layout a path to become a practicing
data scientist.
It could at first be daunting. But, dont be intimidated! Even though I respect Malcolm Gladwell, I have to
encourage you to ignore Malcolm's 10000 hour rule. Anyone with passion, determination and discipline
can become a data scientist in less than 10000 hours...may be 6 months approximately 4 hours per day *
5 days per week * 4 weeks per month * 6 months = 480 hours. Because all this stuff is basic and mostly
intuitive and lurks in the subconscious realm of cognitive apparatus, even that of monkeys, dogs,
leopards and of course human beings. Otherwise, we could not catch a ball or frisbee or a prey. I assure
you none of this involves String theory, Reiman surface, Hilbert dimensions or Tichnoff Embedding
theorem. We already do so much of this subconsciously, we just have to transfer them to the conscious
realm of yourself.
Let us go!
Basics: Khan Academy
why now, perfect storm
advances for computing hardware, networking, tools for communication
introduction to data
operational filter> transactional vs master data
domain filter > what values can it hold >
categorical/qualitative (nominal,ordinal)
numerical/quantitative (interval, ratio)
Statistics Refresher
sample/population
iid
bias
randomness
outlier, anomaly, Bonferroni test
sample means, convergence to population mean
CLT Central Limit theorem
LLN Law of Large numbers
Benford Law small digits
central tendencies measures of, moments
mean (median,mode),variance (standard deviation),skew,kurtosis
comovement, relationship correlation, covariance
distributions normal (Gaussian),poisson,uniform
probability
basic properties,
certainity, uncertainity,
impossibility,
knowable,
unknowables,
known unknowables,
unknown unknowables
counting/frequentist
discrete, conditional, joint probabilities
Bayesian probability
continuous probability
Relationship
regression
parametric
nonparametric
independent vs dependent variables
dependent also known as response
independent aka regressors,predictors
univariate regression
linear relationship y=mx+c
quality of the relationship, goodness of fit
pvalue, null hypothesis, rsquare
assumptions
autocorrelation
multicollinearity
heteroskedasticity nonconstant variance
tests of normality
tests of randomness
transformation
mixtures
standard normal
lognormal
multivariate
logistic regression
odds ratio
Linear Algebra
matrices, identity,square, rectangular,symmetric
operations:transpose, inversion,decomposition
roots, positive definiteness,eigen values
cholesky, principal components, singular value decomposition
Applications
analytics
descriptive
predictive
prescriptive
learning and intelligence need big data 3V
dimensionality reduction
unsupervised learning
clustering
supervised
classification,
measures of classification: TP,TN,FP,FN, accuracy, precision, sensitivity
semisupervised, hybrid
network, hidden, feedback, selfcorrecting
deep learning, Boltzman Machine, Markov Chain
Information Retrieval
Entropy
Gain
References (2 B CONTD) KhanAcademy.com
http://tutors4you.com/probabilitytutorial.htm
http://www.mathportal.org/linear-algebra/vectors/dot-product.php
http://www.stat.berkeley.edu/~brill/Stat153/tstests.pdf
http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/
http://singhal.info/ieee2001.pdf Introduction to Information Retrieval
http://www.cs.columbia.edu/~gravano/Qual/Papers/singhal.pdf
http://times.cs.uiuc.edu/course/410/note/mle.pdf
http://www.dataschool.io/simple-guide-to-confusion-matrix-term
Acknowledgements To all those who have taught me everything I have learned in life, starting with my mother.