16
Introduction to R for Data Science Lecturers dipl. ing Branko Kovač Data Analyst at CUBE/Data Science Mentor at Springboard Data Science zajednica Srbije [email protected] dr Goran S. Milovanovi ć Data Scientist at DiploFoundation Data Science zajednica Srbije [email protected] [email protected]

Introduction to R for Data Science :: Session 4

Embed Size (px)

Citation preview

Page 1: Introduction to R for Data Science :: Session 4

Introduction to R for Data Science

Lecturers

dipl. ing Branko Kovač

Data Analyst at CUBE/Data Science Mentor

at Springboard

Data Science zajednica Srbije

[email protected]

dr Goran S. Milovanović

Data Scientist at DiploFoundation

Data Science zajednica Srbije

[email protected]

[email protected]

Page 2: Introduction to R for Data Science :: Session 4

Control Flow in R

• for, while, repeat

• if, else

• switch

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# Starting with simple 'if‘ num <- 2 # some value to test with if (num > 0) print("num is positive") # if condition num > 0 stands than print() is executed # Sometimes 'if' has its 'else‘ if (num > 0) { # test to see if it's positive print("num is positive") # print in case of positive number } else { print("num is negative") # it's negative if not positive } # Careful: place your else right after the end (‘}’) of the conditional block

Page 3: Introduction to R for Data Science :: Session 4

Vectorized: ifelse

• for, while, repeat

• if, else, ifelse

• switch

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# R is vectorized so there's vectorized if-else simple_vect <- c(1, 3, 12, NA, 2, NA, 4) # just another num vector with NAs ifelse(is.na(simple_vect), "nothing here", "some number") # nothing here if it's an NA or it's a number

Page 4: Introduction to R for Data Science :: Session 4

For loops: slow and slower

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# For loop is always working same way for (i in simple_vect) print(i) # Be aware that loops can be slow if vec <- numeric() system.time( for(i in seq_len(50000-1)) { some_calc <- sqrt(i/10) # this is what makes it slow: vec <- c(vec, some_calc) })

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# This solution is slightly faster iter <- 50000; # this makes it faster: vec <- numeric(length=iter) system.time( for(i in seq_len(iter-1)) { some_calc <- sqrt(i/10); vec[i] <- some_calc # ...not this! })

Page 5: Introduction to R for Data Science :: Session 4

For loops: slow and slower

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# This solution is even faster iter <- 50000 vec <- numeric(length=iter) # not because of this... system.time( for(i in seq_len(iter-1)) { vec[i] <- sqrt(i/10) # ...but because of this! })

Page 6: Introduction to R for Data Science :: Session 4

For loops vs. vectorized functions

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# Another example how loops can be slow # (loop vs vectorized functions) iter <- 50000 system.time(for (i in 1:iter) { vec[i] <- rnorm(n=1, mean=0, sd=1) # approach from previous example }) system.time(y <- rnorm(iter, 0, 1)) # but this is much much faster

Page 7: Introduction to R for Data Science :: Session 4

while, repeat…

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# R also knows about while loop r <- 1 # initializing some variable while (r < 5) { # while r < 5 print(r) # print r r <- r + 1 # increase r by 1 }

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# Nope, we didn't forget 'repeat' loop i <- 1 repeat { # there is no condition! print(i) i <- i + 1 if (i == 10) break # ...so we have to break it if we # don't want infinite loop }

Page 8: Introduction to R for Data Science :: Session 4

switch

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

switch(2, "data", "science", "serbia") # choose one option based on value # More on switch: switchIndicator <- "A“ # switchIndicator <- "switchIndicator“ # switchIndicator <- "AvAvAv“ # play with this three conditions # rare situations where you do not need to enclose strings: ' ', or " “ switch(switchIndicator, A = {print(switchIndicator)}, switchIndicator = {unlist(strsplit(switchIndicator,"h"))}, AvAvAv = {print(nchar(switchIndicator))} )

Page 9: Introduction to R for Data Science :: Session 4

switch()

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

type = 2 cc <- c("A", "B", "C") switch(type, c1 = {print(cc[1])}, c2 = {print(cc[2])}, c3 = {print(cc[3])}, {print("Beyond C...")} # default choice ); # However…

Page 10: Introduction to R for Data Science :: Session 4

switch()

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# if you do this, R will miss the default choice, so be careful w. switch: type = 4 cc <- c("A", "B", "C") switch(type, print(cc[1]), print(cc[2]), print(cc[3]), {print("Beyond C...")} # the unnamed default choice works only # if previous choices are named! ) # switch is faster than if… else… (!)

Page 11: Introduction to R for Data Science :: Session 4

Vectorization

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

### vectorization in R dataSet <- USArrests; # data$Murder, data$Assault, data$Rape: columns of data # in behavioral sciences (psychology or biomedical sciences, for example) we would call them: # variables (or factors, even more often) # in data science and machine learning, we usually call them: FEATURES # in psychology and behavioral sciences, the usage of the term "feature" is usually constrained # to theories of categorization and concept learning # Task: classify the US states according to some global indicator of violent crime # Two categories (simplification): more dangerous and less dangerous (F) # We have three features: Murder, Rape, Assault, all per 100,000 inhabitants # The idea is to combine the three available features. # Let's assume that we arbitrarily assign the following preference order over the features: # Murder > Rape > Assault # in terms of the severity of the consequences of the associated criminal acts

Page 12: Introduction to R for Data Science :: Session 4

Vectorization

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# Let's first isolate the features from the data.frame featureMatrix <- as.matrix(dataSet[, c(1,4,2)]); # Let's WEIGHT the features in accordance with the imposed preference order: weigthsVector <- c(3,2,1); # mind the order of the columns in featureMatrix # Essentially, we want our global indicator to be a linear combination of all three selected # features, where each feature is weighted by the corresponding element of the weigthsVector: featureMatrix <- cbind(featureMatrix,numeric(length(featureMatrix[,1]))); for (i in 1:length(featureMatrix[,1])) { featureMatrix[i,4] <- sum(weigthsVector*featureMatrix[i,1:3]); # don't forget: this "*" multiplication in R is vectorized and operates element-wise # we have a 1x3 weightsVector and a 1x3 featureMatrix[i,1:3], Ok # sum() then produces the desired linear combination }

Page 13: Introduction to R for Data Science :: Session 4

Vectorization

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# Classification; in the simplest case, let's simply take a look at # the distribution of our global indicator: hist(featureMatrix[,4],20); # it's multimodal and not too symmetric; go for median criterion <- median(featureMatrix[,4]); # And classify: dataSet$Dangerous <- ifelse(featureMatrix[,4]>=criterion,T,F); # Ok. You will never do this before you have a model that has actually *learned* the # most adequate feature weights. This is an exercise only. # ***Important***: have you seen the for loop above? Well... # N e v e r d o t h a t. dataSet$Dangerous <- NULL;

Page 14: Introduction to R for Data Science :: Session 4

Vectorization

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# In Data Science, you will be working with huge amounts of quantitative data. # For loops are slow. But in vector programming languages like R... # matrix computations are seriously fast. # What you ***want to do*** is the following: # Let's first isolate the features from the data.frame featureMatrix <- as.matrix(dataSet[, c(1,4,2)]); # Let's WEIGHT the features in accordance with the imposed preference order: weigthsVector <- c(3,2,1); # mind the order of the columns in featureMatrix # Feature weighting: wF <- weigthsVector %*% t(featureMatrix); # In R, t() is for: transpose # In R, %*% is matrix multiplication

Page 15: Introduction to R for Data Science :: Session 4

Vectorization

Intro to R for Data Science

Session 4: Control Flow

# Introduction to R for Data Science

# SESSION 4 :: 19 May, 2016

# oh yes: R knows about row and column vectors - and you want to put this one # as a COLUMN in your dataSet data.frame, while wF is currently a ROW vector, look: wF length(wF) wF <- t(wF) # and classify: dataSet$Dangerous <- ifelse(wF>=median(wF),T,F);

Page 16: Introduction to R for Data Science :: Session 4