34
VERY BASIC OVERVIEW OF STATISTICS AND MACHINE LEARNING INTRODUCTION TO DATA SCIENCE ELI UPFAL

VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

VERY BASIC OVERVIEW OF STATISTICS ANDMACHINE LEARNING

INTRODUCTION TO DATA SCIENCE

ELI UPFAL

Page 2: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

MACHINE LEARNING – exciting!

Page 3: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

MACHINE LEARNING – exciting!

STATISTICS - boring

Page 4: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

MACHINE LEARNING – exciting!

STATISTICS - boring

ACTULLY – not that different

Page 5: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

EXTRACTING INFORMATION FROM DATA

Data Analysis

ModelPredictions

Page 6: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

“IT’S DIFFICULT TO MAKE PREDICTIONS, ESPECIALLY ABOUT THE FUTURE”

Example of ML predictions:

Observation: Millions of connections to web sites.

If you connect from IP 128.148.31.5, your computer type is …., your operating system is …., your mouse movements profile is …. , thenyour likely age is …. your likely disposable income is …. your likely gender is ...., your likelihood of a purchase is…

Works very well in practice!

Page 7: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

“IT’S DIFFICULT TO MAKE PREDICTIONS, ESPECIALLY ABOUT THE FUTURE”Example of ML predictions:

Observation: Millions of connections to web sites.

If you connect from IP 128.148.31.5, your computer type is …., your operating system is …., your mouse movements profile is …. , etc. then your likely age is …. your likely disposable income is …. your gender is ...., your likelihood of a purchase is…

Works very well in practice!

Observation: Among flu patients, those who take medicine have a longer recovery

If you take medicine for flu then you have longer recovery.

Obviously wrong!

Page 8: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

Just because a machine learning, data mining, or data analysis application outputs a result - it doesn’t mean that it’s right

Data analysis is often misleading

Machine learning without statistical analysis is pure nonsense

Page 9: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data
Page 10: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data
Page 11: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data
Page 12: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

How do we distinguish between facts and coincident?

Which Correlation is more convincing?

Page 13: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

DATA: LEARNING FROM A TRAINING SET (SAMPLE)

Page 14: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

THE DATA IS THR MODEL

Page 15: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

STATISTICS VS. MACHINE LEARNING

First there was statistics: Strict criteria for when an hypothesis (”discovery”) is statistically significant

Strong assumptions, elaborate computation

Then came Computer Science:Emphasize on efficient computation

Output best approximation, even if not certain

... And a lot of BIG data

With lucrative business applications.

Page 16: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

GOAL OF THIS PART OF THE COURSE

Basic machine learning tools for data analysis

Very basic concepts in probability and statistics

Understanding the power and pitfalls of data analysis

Page 17: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

MACHINE LEARNING PROBLEMS

17

classification or categorization clustering

regression dimensionality reduction

Supervised Learning Unsupervised Learning

Dis

cret

eC

ontin

uous

Page 18: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

MACHINE LEARNING PROBLEMS

18

classification or categorization clustering

regression dimensionality reduction

Supervised Learning Unsupervised Learning

Dis

cret

eC

ontin

uous

Page 19: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

EXAMPLE: TITANIC DATASET

19

Label Features

Can we predict survival from these features?

Page 20: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data
Page 21: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

21

Page 22: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

22

Page 23: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

23

Page 24: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data
Page 25: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

DECISION TREE

25

sex

age

pclass

femalemale

survive

not survive

3

1

not survive

survive

<=4

not survive

>4

Every path from the root is

a rule

2

Page 26: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

WHAT IS A CLASSIFIER

Apply a prediction function to a feature representation of animage/data-set to get the desired output:

f( ) = “apple”f( ) = “tomato”f( ) = “cow”

Slide credit: L. Lazebnik

Page 27: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

THE MACHINE LEARNING FRAMEWORK

y = f(x)

Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set

Testing: apply f to a never before seen test example x and output the predicted value y = f(x)

output prediction function

features

Slide credit: L. Lazebnik

Page 28: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

DECISION BOUNDARIES:DECISION TREES

x x

x x

x

xx

xo

oo

o

o

oo

x2

x1

Page 29: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

CLASSIFIER OVERVIEW

Page 30: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

RECAP: DECISION TREES

Representation• A set of rules: IF…THEN conditions

Evaluation• coverage: # of data points that satisfy conditions• accuracy = # of correct predictions / coverage

Optimization• Build decision tree that maximize accuracy

Page 31: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

ML PIPELINE (SUPERVISED)

31

Page 32: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

FEATURESFact Table

- Shop_ID- Customer_

ID- Date_ID- Product_ID- Amount- Volume- Profit- …

Product

- Product_ID- Type_ID- Brand_ID- Length- Height- Depth- Weight- …

Product_Type

- Type_ID- Name- Description- …

Brand

- Brand_ID- Name- …

Fact Table

- Shop_ID- Customer_

ID- Date_ID- Product_ID- Amount- Volume- Profit- Delivery

Time- …

Custermer State

ProductType

Product Weight

Volume (L*H*D)

Month Delivery Time

Page 33: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

IMAGE FEATURES

Raw pixels

Histograms

Geometric features – edge

Corner,…

Slide credit: L. Lazebnik

Page 34: VERY BASIC OVERVIEW OF STATISTICS AND MACHINE …Just because a machine learning, data mining, or data analysis application outputs a result -it doesn’t mean that it’s right Data

TEXT FEATURES

Hair

Spam

Not Spam

Bag of Words𝑉𝑖𝑎𝑔𝑟𝑎𝑆𝑜𝑓𝑡𝐻𝑒𝑟𝑏𝑒𝑙𝑃𝑖𝑙𝑙𝑠𝐴𝑟𝑒…

ℎ𝑒𝑟𝑏𝑒𝑙𝑝𝑖𝑙𝑙𝑠𝑝𝑖𝑙𝑙𝑠𝑓𝑜𝑟𝑓𝑜𝑟𝐻𝑎𝑖𝑟

𝐻𝑎𝑖𝑟𝑒𝑛𝑙𝑎𝑟𝑔𝑒𝑚𝑒𝑛𝑡𝑒𝑛𝑙𝑎𝑟𝑔𝑒𝑚𝑒𝑛𝑡𝑇𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒𝑠

N-Grams