27
Innovation Center Interdisciplinary Product Development BMW USA 1 Interdisciplinary Product Development CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott and Ugo Buy Author: Ugo Buy Introduction to Data Science

Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 1

Interdisciplinary Product Development CS 491, DES 430, IE 444, ME 444, MKTG 477 UIC Innovation Center Fall 2017 and Spring 2018 Instructors: Charles Frisbie, Marco Susani, Michael Scott

and Ugo Buy Author: Ugo Buy

Introduction to Data Science

Page 2: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 2

•  Discipline seeking to extract knowledge and insights from large amounts of raw data

−  Examples: Predict income level from age; predict gender of Twitter user from colors chosen in tweets, etc.

•  Multidisciplinary in nature, mostly borrowing from:

−  Statistics

−  Computer Science (databases, machine learning, data mining, parallel computing)

−  Data Visualization

What is data science?

•  AKA “Data Analytics”

•  Wide array of applications

−  Medical sciences (healthcare)

−  Finance (market predictions)

−  Logistics, etc.

Page 3: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 3

•  Multidisciplinary convergence:

! Math and statistics

! Domain knowledge

! Computer science

•  Detailed descriptions make it explicit the role of HCI and UX in data science

! HCI = Human Computer Interaction

! UX = User Experience

Drew Conway’s Venn diagram

Page 4: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 4

Overarching pedagogical goal:

•  Learn how to extract knowledge from mobility and transportation datasets

!  Public datasets: UIC library, Bureau of Transportation Statistics, Chicago Data Portal, etc.

!  BMW datasets (hopefully)

Our learning objectives

Specific learning objectives:

•  Learn the basics of statistical learning

!  Input variables (aka features or predictors) vs. responses (aka outcomes or output variables)

!  Distinguish different prediction methods: regression and classification

!  Regression = predicted variable is continuous (e.g., predict vehicle value based on family income, etc.)

!  Classification = predicted variable is discrete (e.g., fraudulent vs. legit transaction, male vs.female user )

•  Learn how to visualize analysis results (Professor Susani)

!  Box plots, Scatter plots, Histograms, etc.)

Page 5: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 5

Statistical learning: An Introduction to Statistical Learning

•  PDF available from http://www-bcf.usc.edu/~gareth/ISL/

Resources

Computer Science: Various languages with built-in support for statistical analysis, e.g.,

•  R – https://www.r-project.org/

•  Hadoop – http://hadoop.apache.org/

Page 6: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 6

1.  SimplyAnalytics database(UIC Library)

!  EASI " Census Data " Employment

!  EASI " Census Data " Vehicles

2.  Chicago Data Portal (public)

!  https://data.cityofchicago.org/

!  Transportation data

!  Similar sites for NYC, LA, SFO, etc.

!  Counties sometimes have similar sites

3.  National transit database (public)

!  https://www.transit.dot.gov/ntd

!  Asset Inventory Module (aka vehicles)

Public and UIC datasets

4.  Reference USA database (public)

!  http://resource.referenceusa.com/

!  Use advanced search

!  Location and number of gas stations, car rental companies, etc.

5.  Bureau of transportation statistics (BTS)

!  https://data.cityofchicago.org/

!  Intermodal transportation database

!  Data on commercial aviation

!  Data on transportation economics

Page 7: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 7

•  We extract information by means of statistical analysis

•  Paradigm

1.  Formulate a hypothesis (i.e. ask a question)

!  Examples: Is there a correlation between urban traffic density and air pollution?

2.  Apply statistical learning methods to dataset

!  Compute correlation indices between input and output variable, e.g., using regression analysis

3.  Analyze statistical data to validate or refute initial hypothesis

!  Null hypothesis: No significant correlation between input and output variables (variables are independent of each other)

!  Alternative hypothesis: Variables are in fact correlated (e.g., when input is high, output is likely to be low)

What we do with datasets of interest

Page 8: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 8

•  Ultimate goal of correlation analysis: Establish causal relationships between different variables

!  If two variables are correlated, there could be a causal relationship between the variables…

!  ... or not

•  Analysis of beach communities shows high correlation between ice cream sales and shark attacks

!  But nobody is suggesting cutting ice cream sales as a way of preventing shark attacks

!  Ice cream sales and shark attacks are correlated but not causally related

Correlation ≠ Causality

?

Source:h*ps://m.xkcd.com/552/

Page 9: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 9

•  Average (aka mean value): Given a set of n values, their average μ is the sum of the values divided by the number n of values that were added together

!  Assume dataset = (15, 18, 6, 20, 24), then average μ = 16 = (12+18+6+20+24)/5

•  Median: Given a set of n values, median M is the “value in the middle”

!  Dataset above " M = 18

!  Often more useful than average, because average sometimes affected by outliers

Basic statistics definitions

•  Variance: Average of the squared differences of the values from the mean, denoted by σ2

!  Indication of “how spread out” values are around the average

!  Sets (5, 10, 10, 15) and (9, 10, 10, 11) have the same μ=10, but their variances are different (12.5 vs. 0.5)

•  Standard deviation: The square root of the variance, denoted by σ

!  How much you should expect random value to differ from mean

!  σ = 3.535 and σ = 0.707 for two sets above

Page 10: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 10

Plotting wage data (response variable) with respect to age (input variable) or year (input variable) Blue lines represent averages for each age and year value – Help make sense of data!

Source: ISLR, page 2

How do statistics help us?

Page 11: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 11

Given an input variable X, estimate response variable Y as a function of X + some error ε See how f may help understand relation between input and output variables Population = 30 people with different incomes and education

Source: ISLR, page 16

The key goal: Express output as a function of input + some error

Page 12: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 12

•  Given a response variable Y, and a set of input variables Xi

!  Which input variables will affect the response?

!  What is the relationship between the response and each input variable?

!  Can the relationship be modeled as a linear function or is it more complex?

•  We will consider linear relationships first

•  Example: different advertising markets

Source: ISLR, page 16

The inference problem

Page 13: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 13

Simple linear regression

•  Statistical model assuming that a single input variable is linearly related to response variable

•  Basic assumption: The relation between input and output is arranged as a line

!  Actual relation drawn as a line

!  Could be true or false, but a good starting point for analyzing CAT datasets

!  Linear prediction from n observations

!  Goal: Try to get predicted values as close as possible to actual values

Page 14: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 14

Drawing the line

•  What is the line that best fits our observations?

!  Must come up with predicted slope and intercept values β0 and β1

•  Least squares method: Minimize the square of the errors between observed and predicted values

!  Residual (error of one observation is difference between observed and predicted value):

!  Minimize RSS = Residual Sum of Squares when choosing β0 and β1

!  Good news: You’ll never have to do calculation of β0 and β1 yourself

Page 15: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 15

The numbers for TV ad problem

•  Advertising dataset (From http://www-bcf.usc.edu/~gareth/ISL/data.html)

•  Predicted slope β1 = 0.0475

!  Sales to increase by 47.5 units of product for every $1,000 spent in TV advertising

•  Predicted intercept β0 = 7.03

!  Sales without TV advertising predicted to be 7,030 units

Page 16: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 16

How good of a prediction?

•  Must validate linear model assumption, but how?

1. Residual Standard Error (RSE): Ratio of RSS and number of observations n:

•  RSE is absolute value of lack of fit of linear prediction (= 3.26 for TV ad data; prediction off by 3,260 units on average)

2. R2 statistic: Normalized version of RSE (values between 0 and 1): Proportion of variability of Y that is explained by X

where

Values close to 1 indicate high correlation; close to 0 indicate low correlation

Page 17: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 17

Analyzing public datasets

•  Decide whether certain features may affect each other (e.g., urban pollution vs. population density)

•  Select features of interest (X and Y)

•  Regress one feature over the other, using R or other analysis system

•  Do regression analysis (e.g., using R or other statistical analysis package)

•  Check the null hypothesis (X and Y are not correlated)

!  If null hypothesis is true, slope β1 will be zero or close to zero

!  How close to zero?

!  t-statistic: Normalized value of slope β1 relative to zero

!  p-value: Probability that given t-value be consistent with null hypothesis; reject null hypothesis for p-value less than 5%

Page 18: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 18

The values for the TV ad dataset

•  Source: ISLR, Pages 68 and 69

Page 19: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 19

The language R

•  Programming language for statistical computing and graphics

•  Named after initial letter of founders’ names, Ross Ihaka and Robert Gentleman

•  Relatively easy syntax

•  Lots of built-in analysis methods (both for regression and classification)

•  Basic language has command line interface; various GUI-based systems exist (e.g., Rattle, R Studio, etc.)

!  GUI tools usually include command-line window

•  Target platform: standalone computer (vs. Hadoop)

•  Freely available on MS Windows, Linux, and Mac OS X platforms (GNU GPL terms)

!  Quite extensible " Packages

•  Software, documentation and reference materials available at https://cran.r-project.org/

Page 20: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 20

R: Basic commands

•  Most commands execute built-in and user-defined functions

•  Syntax: function_name(arg1, arg2, …)

!  Example: sqrt is a 1-argument function returning the argument’s square root

!  sqrt(9) " 3

•  Values returned by functions can be saved with variables

!  x = sqrt(9)

!  Now x equals 3

•  Function c() concatenates args into a vector of values, e.g.,

!  c(10, 20, 30, 40) ! 10 20 30 40

•  Functions length(), mean(), median(), var(), sd() take a vector of values and return the obvious

Page 21: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 21

R: Matrix commands

•  Matrix: A table of numbers (2-dimensional matrix)

!  R representation of CAT spreadsheets

•  Create matrix with function:

matrix(elements, row_number, column_number)

•  Typically assign matrix to a variable to “remember” it

•  Matrix element access by values or sets of values for row and column

!  Use name of matrix + row index and column index in square brackets, e.g.,

!  y[3,2] returns second element in the third row of y

!  Ranges possible for row and column index

Page 22: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 22

R: Read data from spreadsheets

•  Function read.csv() loads spreadsheet into R

!  Input: Comma-Separated Values (csv) spreadsheet

!  Output: A 2-dimensional matrix

•  Function dim() returns dimensions

•  Function names() returns column names

•  Function cor() returns correlation index (= sqrt of R2)

•  Use dollar sign $ to denote column by symbolic name

!  Syntax: matrix_name$column_name

•  Alternatively,

!  Use attach() function (sets default matrix)

!  Use numeric indices

Page 23: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 23

R: Graphic display tools

Function plot() opens window with scatter plot of 2 features

Function hist() shows histogram of 1 feature

Page 24: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 24

R: Statistical learning tools

•  Function lm() computes linear model

!  Funny syntax uses tilde character

var = lm(response_var~input1+input2)

•  Function summary(var) returns summary data

•  Function abline(var) returns column names (use after plot())

!  Beware of switching response and predictors order between lm and plot()

Page 25: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 25

R: Statistical outputs

Page 26: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 26

R: Some of your friends… use wisely

•  Help: Type function name preceded by question mark to get function documentation (e.g., ?lm, ?read.csv, etc.)

•  Function write.csv() saves an object to a file – Syntax: write.csv(object.name,”file.name”)

•  Function subset() allows you to select rows and columns based on conditions on values stored, e.g.,

!  selected.data = subset(original.data, RunTime >= 10 | RunTime < 5, select=c(RunTime, …))

!  See http://www.statmethods.net/management/subset.html

•  Function merge() allows you to perform database JOIN operations on multiple spreadsheets

•  All the functions shown in the previous slides

Page 27: Introduction to Data Science - University of Illinois at ... › ... › public › ...Data_Science_Slides.pdf · Interdisciplinary Product Innovation Center Development BMW USA 2

Innovation Center Interdisciplinary Product Development BMW USA 27

References

•  ISLR: http://www-bcf.usc.edu/~gareth/ISL/

•  R Language System: https://www.r-project.org/

•  Hadoop Language System: http://hadoop.apache.org/

•  Advertising dataset: http://www-bcf.usc.edu/~gareth/ISL/data.html

•  Nice R GUI #1: https://rattle.togaware.com (Rattle – runs on Windows or Linux)

•  Nice R GUI #2: https://www.rstudio.com