24
From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu [email protected]

From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu [email protected] 2 The Data-Information-Knowledge-Wisdom

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

From statistics to data science

BAE 815 (Fall 2017)

Dr. Zifei Liu

[email protected]

Page 2: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

2

The Data-Information-Knowledge-Wisdom Hierarchy

- Russell Ackoff

What?

How much?

How many?

How?

Why?

Individual facts

(quantities,

characters, or

symbols)

Page 3: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

3

1 exabytes= 1billion GB=1018 bytes

Page 4: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

4

How do we make decisions?

Experience

Data(Experiments)

Statistics

Big data Data science

(Probability, uncertainty)

Page 5: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

• How much? - or - How many?

– Regression algorithms

• What it is? Is this A or B?

– Classification algorithms

• Is this weird?

– Anomaly detection algorithms

Questions that you can answer with data science5

Page 6: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Correlation vs. causation6

A B

(1) A B

(2) A B

(3) A B

C

(4) A B (5) Coincidence

Causation is not observed but inferred

• Social drinking vs. earnings

• Energy consumption vs. economic growth

• Debt rate vs. performance of company

• Shoe size vs. reading ability

• Ice cream consumption vs. rate of drowning

• Obesity vs. diabetes (risk factor)

• Children who get tutored get worse grades than

children who do not get tutored

Page 7: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Population vs. sample7

Population

Sample

Statistic

Standard deviation

Standard error

n

sSE

Y

N

n

Page 8: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

8

True

situationOur conclusion Control errors

No effect

(negative)

Not significant True negative

Significant

(Reject H0)

False positive

“Type I error”

Confidence level,

P value

Has an effect

(positive)

Significant

(Reject H0)True positive

Not significantFalse negative

"Type II error"

Statistical power,

sample size

Null hypothesis (H0): A has no effect on B.

Page 9: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Confounding/nuisance

variables

(undesired sources of variation that

affect the dependent variable)

9

Dependent variable

A

Independent variable

B

D

C

E

F

If you can, fix the confounding variable (make it a constant).

If you can’t fix the confounding variable, use blocking.

If you can neither fix nor block the confounding variable, use randomization.

Avoid confounding variables

Page 10: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Common probability distributions10

Page 11: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Regression analysis11

R2: coefficient of determination, 0 to 1

R: correlation coefficient, -1 to +1

• Linear regression

• Logistic regression

• Nonlinear regression

• Stepwise regression- Forward

- Backward

• Ridge, LASSO &

ElasticNet regression- Handle multicollinearity

variables

Page 12: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Machine learning12

• Learning:

- improve performance from experience.

• Machine learning:

- teach computers to make and improve predictions based

on data. approach to achieve artificial intelligence

- classification

- prediction (regression)

• Data mining:

- use algorithms to create knowledge from data.

Page 13: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Bayesian statistics for machine learning13

Bayes' rule provides the tools to update the probability for a

hypothesis as more evidence or information becomes available.

New

Page 14: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Common data science algorithms14

• Linear regression• Decision tree• Random forest• Association rule mining• K-Means clustering

Unsupervised = exploratory

Supervised = predictive

Page 15: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Decision tree15

• The attribute with the largest std reduction is chosen for the decision node.

• Stop when std for the branch becomes smaller than a certain fraction (e.g., 5%)

of std for the full dataset or when too few instances remain in the branch.

http://www.saedsayad.com/decision_tree_reg.htm

4/14

Std=3.49

5/14

Std=10.87 5/14

Std=7.78

Std=9.32

Page 16: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Decision tree16

• You can define a split-point for either categorical variable or continuous variable.

• Split the dataset based on homogeneity of data.

X2

X1

Classification & Regression Trees (CART)

(Ankit Sharma, 2014)

Page 17: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Random forest17

• Averaging multiple deep decision trees, trained on different parts of the same

training set; Overcoming overfitting problem of individual decision tree

• Widely used machine learning algorithm for classification

- Approx. 2/3rd of the total training data are selected at random to grow each tree.

- Predictor variables are selected at random and the best split is used to split the node.

- For each tree, using the leftover (1/3rd) data to calculate the out of bag error rate.

- Each tree gives a classification. The forest chooses the classification having the most

votes over all the trees in the forest.

Page 18: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Variable importance plot18

Random forests can be used

to rank the importance of

variables in a regression or

classification problem.

• Mean decrease accuracy: How much

the model accuracy decreases if we

drop that variable

• Mean decrease gini: Measure of variable

importance based on the Gini impurity index

used for the calculation of splits in trees

Classifying income of adults

Page 19: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Association rule mining19

An association rule is a pattern that states when X occurs, Y occurs with certain probability (If/then statement).

Initially used for Market Basket Analysis to find how items purchased by customers are related.

n

countYXsupport

). (

countX

countYXconfidence

.

). (

Goal: Find all rules that satisfy the user-specified minimum support

and minimum confidence.

Page 20: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

itemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3

{3 5} 2

itemset

{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset

{2 3 5}

itemset sup

{2 3 5} 2

TID Items

100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Min support =50%

2,35 confidence=100%

3,52 confidence=100%

2,53 confidence=67%

Association rule mining (the Apriori Algorithm)

Page 21: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

K-Means clustering21

The algorithm works iteratively to assign each data point to one of K

groups based on feature similarity (ex. defined distance measure).

• Find the centroids of the K clusters

• Labels for the training data

Page 22: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Open-source language for data science

22

Page 23: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

23

Page 24: From statistics to data science - Kansas State …zifeiliu/files/fac_zifeiliu...From statistics to data science BAE 815 (Fall 2017) Dr. Zifei Liu Zifeiliu@ksu.edu 2 The Data-Information-Knowledge-Wisdom

Demand for deep analytical talent in the U.S. projected to be 50-60% greater than

supply by 2018.

24

Become a data scientist?

Job trends form indeed.com