India software developers conference 2013 Bangalore

Data Science 101: Using R Language to get Big Insights

Satnam Singh,

Senior Chief Engineer,

Samsung Research India – Bangalore

[ Twitter - @satnam74s]

India Software Developers Conference, Bangalore

March 16, 2013

2

Motivation: Using Data to get Business Insights

Data Bases& Clusters



Insights? Insights?

Insights?

Ref. [kaggle.com]

Data Science Programming Languages

Why R?• Popular, Free• Open source• Multi-platform• Vectorization• Many statistical packages• Large support base• Obj. oriented prog. lang.

Ref [http://www.r-project.org]

R Language Basics

> y <- 21> y[1] 21 > z = 233> z[1] 233

> y <- c(1,2,3,4)> y[1] 1 2 3 4

Simple Operations

VectorOperations

FunctionCalls

5

R Language: Data Structures Examples

• Data frame

• Matrix

• List

> MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<-data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)

> MyMatrix<-as.matrix(MyFamilyage)> Mydataframe <-as.data.frame(MyMatrix)

> MyList <-a.list(Mydataframe)

6

Case Study: Activity Recognition

• Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc.

Example of Accelerometer data

Smartphone’s Accelerometer

Sensor

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University[Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar

7

Data Analysis - Steps

Feature Extraction

Time Series Data 43 FeaturesMean for eachacc. Axis (3)

Std. dev. for eachacc. Axis (3)

200 samples (10 sec)

Avg. Abs. diff. fromMean for eachacc. Axis (3)

Avg. Resultant Acc. (1)

Histogram (30)

ClassifiersCART: Decision TreeRF: Random Forest

Classify the Activity

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University

Data Visualization – Activity (Class Variable)

[Ref] Rattle R Data Mining Tool

ds <- rbind(summary(na.omit(crs$dataset[,]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Walking",]$class)))

ord <- order(ds[1,], decreasing=TRUE)

bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7))

dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))

Bar Plot

Dot Plot

Data Visualization Example – Variable Yavg.ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG"], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Walking","YAVG"], grp="Walking"))

bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE)

require(doBy, quietly=TRUE)

points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8)

hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)


• Easy to interpret

Blue : Positive correlation

Red: Negative correlation

Correlation Plot


require(ellipse, quietly=TRUE)

crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson")

crs$ord <- order(crs$cor[1,])crs$cor <- crs$cor[crs$ord, crs$ord]

print(crs$cor)

plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]

Functions Library DiscriptionCluster hclust stats Hierarchical cluster analysis

kmeans stats Kmeans clustering

Classifiers glm stats Logistic regression

rpart rpart Recursive partitioning and regression trees

ksvm kernlab Support Vector Machine

apriori arules Rule based classification

Ensemble ada ada Stochastic boosting

randomForest randomForest Random Forests classification and regression

Data Science R Packages

Decision Tree - Visualization


• Decision Tree Model Results:

n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342

Decision Tree

rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))

Random Forest: Ensemble of Trees


…

ΣRandom Forest

Tree1 Tree2

Treen

• Random Forest Model Results:

Number of observations used to build the model: 3792Type of random forest: classificationOOB estimate of error rate: 11.05%Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.errorDownstairs 204 7 0 1 64 97 0.45308311Jogging 6 1117 0 0 8 7 0.01845343Sitting 0 0 209 5 1 0 0.02790698Standing 4 0 0 177 4 0 0.04324324Upstairs 48 31 1 0 276 97 0.39072848Walking 20 1 1 1 15 1390 0.02661064

Random Forest Package in R

randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)

• Fusion of data science and domain knowledge enables the big insights from the data

• R language provides a platform to rapidly build prototypes and test the ideas

• Getting data insights is an outcome of intense team effort between various stakeholders

16

Summary

• R Project: http://www.r-project.org• Activity Recognition Dataset- “ The Impact of Personalization on

Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05

• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010

• R wiki:http://rwiki.sciviews.org/doku.php

• R graph gallery:http://addictedtor.free.fr/graphiques/thumbs.php

• Kickstarting R:http://cran.r-project.org/doc/contrib/Lemon-kickstart/

• Rattle – R Data Mining Tool [http://rattle.togaware.com/]• Sensor Platforms, http://www.sensorplatforms.com/context-aware/• Movea, http://www.movea.com/• Alohar, https://www.alohar.com

17

References

Technology

India software developers conference 2013 Bangalore