17
Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013

India software developers conference 2013 Bangalore

Embed Size (px)

DESCRIPTION

India software developers conference 2013 Bangalore

Citation preview

Page 1: India software developers conference 2013 Bangalore

Data Science 101: Using R Language to get Big Insights

Satnam Singh,

Senior Chief Engineer,

Samsung Research India – Bangalore

[ Twitter - @satnam74s]

India Software Developers Conference, Bangalore

March 16, 2013

Page 2: India software developers conference 2013 Bangalore

2

Motivation: Using Data to get Business Insights

Data Bases& Clusters

Data Bases& Clusters

Data Bases& Clusters

Insights? Insights?

Insights?

Page 3: India software developers conference 2013 Bangalore

Ref. [kaggle.com]

Data Science Programming Languages

Why R?• Popular, Free• Open source• Multi-platform• Vectorization• Many statistical packages• Large support base• Obj. oriented prog. lang.

Ref [http://www.r-project.org]

Page 4: India software developers conference 2013 Bangalore

R Language Basics

> y <- 21> y[1] 21 > z = 233> z[1] 233

> y <- c(1,2,3,4)> y[1] 1 2 3 4

Simple Operations

VectorOperations

FunctionCalls

Page 5: India software developers conference 2013 Bangalore

5

R Language: Data Structures Examples

• Data frame

• Matrix

• List

> MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<-data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)

> MyMatrix<-as.matrix(MyFamilyage)> Mydataframe <-as.data.frame(MyMatrix)

> MyList <-a.list(Mydataframe)

Page 6: India software developers conference 2013 Bangalore

6

Case Study: Activity Recognition

• Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc.

Example of Accelerometer data

Smartphone’s Accelerometer

Sensor

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University[Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar

Page 7: India software developers conference 2013 Bangalore

7

Data Analysis - Steps

Feature Extraction

Time Series Data 43 FeaturesMean for eachacc. Axis (3)

Std. dev. for eachacc. Axis (3)

200 samples (10 sec)

Avg. Abs. diff. fromMean for eachacc. Axis (3)

Avg. Resultant Acc. (1)

Histogram (30)

ClassifiersCART: Decision TreeRF: Random Forest

Classify the Activity

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University

Page 8: India software developers conference 2013 Bangalore

Data Visualization – Activity (Class Variable)

[Ref] Rattle R Data Mining Tool

ds <- rbind(summary(na.omit(crs$dataset[,]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Walking",]$class)))

ord <- order(ds[1,], decreasing=TRUE)

bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7))

dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))

Bar Plot

Dot Plot

Page 9: India software developers conference 2013 Bangalore

Data Visualization Example – Variable Yavg.ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG"], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Walking","YAVG"], grp="Walking"))

bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE)

require(doBy, quietly=TRUE)

points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8)

hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)

[Ref] Rattle R Data Mining Tool

Page 10: India software developers conference 2013 Bangalore

• Easy to interpret

Blue : Positive correlation

Red: Negative correlation

Correlation Plot

[Ref] Rattle R Data Mining Tool

require(ellipse, quietly=TRUE)

crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson")

crs$ord <- order(crs$cor[1,])crs$cor <- crs$cor[crs$ord, crs$ord]

print(crs$cor)

plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]

Page 11: India software developers conference 2013 Bangalore

Functions Library DiscriptionCluster hclust stats Hierarchical cluster analysis

kmeans stats Kmeans clustering

Classifiers glm stats Logistic regression

rpart rpart Recursive partitioning and regression trees

ksvm kernlab Support Vector Machine

apriori arules Rule based classification

Ensemble ada ada Stochastic boosting

randomForest randomForest Random Forests classification and regression

Data Science R Packages

Page 12: India software developers conference 2013 Bangalore

Decision Tree - Visualization

[Ref] Rattle R Data Mining Tool

Page 13: India software developers conference 2013 Bangalore

• Decision Tree Model Results:

n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342

Decision Tree

rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))

Page 14: India software developers conference 2013 Bangalore

Random Forest: Ensemble of Trees

[Ref] Rattle R Data Mining Tool

ΣRandom Forest

Tree1 Tree2

Treen

Page 15: India software developers conference 2013 Bangalore

• Random Forest Model Results:

Number of observations used to build the model: 3792Type of random forest: classificationOOB estimate of error rate: 11.05%Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.errorDownstairs 204 7 0 1 64 97 0.45308311Jogging 6 1117 0 0 8 7 0.01845343Sitting 0 0 209 5 1 0 0.02790698Standing 4 0 0 177 4 0 0.04324324Upstairs 48 31 1 0 276 97 0.39072848Walking 20 1 1 1 15 1390 0.02661064

Random Forest Package in R

randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)

Page 16: India software developers conference 2013 Bangalore

• Fusion of data science and domain knowledge enables the big insights from the data

• R language provides a platform to rapidly build prototypes and test the ideas

• Getting data insights is an outcome of intense team effort between various stakeholders

16

Summary

Page 17: India software developers conference 2013 Bangalore

• R Project: http://www.r-project.org• Activity Recognition Dataset- “ The Impact of Personalization on

Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05

• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010

• R wiki:http://rwiki.sciviews.org/doku.php

• R graph gallery:http://addictedtor.free.fr/graphiques/thumbs.php

• Kickstarting R:http://cran.r-project.org/doc/contrib/Lemon-kickstart/

• Rattle – R Data Mining Tool [http://rattle.togaware.com/]• Sensor Platforms, http://www.sensorplatforms.com/context-aware/• Movea, http://www.movea.com/• Alohar, https://www.alohar.com

17

References