Upload
satnam-singh
View
113
Download
0
Embed Size (px)
DESCRIPTION
India software developers conference 2013 Bangalore
Citation preview
Data Science 101: Using R Language to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Bangalore
[ Twitter - @satnam74s]
India Software Developers Conference, Bangalore
March 16, 2013
2
Motivation: Using Data to get Business Insights
Data Bases& Clusters
Data Bases& Clusters
Data Bases& Clusters
Insights? Insights?
Insights?
Ref. [kaggle.com]
Data Science Programming Languages
Why R?• Popular, Free• Open source• Multi-platform• Vectorization• Many statistical packages• Large support base• Obj. oriented prog. lang.
Ref [http://www.r-project.org]
R Language Basics
> y <- 21> y[1] 21 > z = 233> z[1] 233
> y <- c(1,2,3,4)> y[1] 1 2 3 4
Simple Operations
VectorOperations
FunctionCalls
5
R Language: Data Structures Examples
• Data frame
• Matrix
• List
> MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<-data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)
> MyMatrix<-as.matrix(MyFamilyage)> Mydataframe <-as.data.frame(MyMatrix)
> MyList <-a.list(Mydataframe)
6
Case Study: Activity Recognition
• Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc.
Example of Accelerometer data
Smartphone’s Accelerometer
Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University[Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
7
Data Analysis - Steps
Feature Extraction
Time Series Data 43 FeaturesMean for eachacc. Axis (3)
Std. dev. for eachacc. Axis (3)
200 samples (10 sec)
Avg. Abs. diff. fromMean for eachacc. Axis (3)
Avg. Resultant Acc. (1)
Histogram (30)
ClassifiersCART: Decision TreeRF: Random Forest
Classify the Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University
Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <- rbind(summary(na.omit(crs$dataset[,]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Walking",]$class)))
ord <- order(ds[1,], decreasing=TRUE)
bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7))
dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))
Bar Plot
Dot Plot
Data Visualization Example – Variable Yavg.ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG"], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Walking","YAVG"], grp="Walking"))
bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE)
require(doBy, quietly=TRUE)
points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8)
hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)
[Ref] Rattle R Data Mining Tool
• Easy to interpret
Blue : Positive correlation
Red: Negative correlation
Correlation Plot
[Ref] Rattle R Data Mining Tool
require(ellipse, quietly=TRUE)
crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson")
crs$ord <- order(crs$cor[1,])crs$cor <- crs$cor[crs$ord, crs$ord]
print(crs$cor)
plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
Functions Library DiscriptionCluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic regression
rpart rpart Recursive partitioning and regression trees
ksvm kernlab Support Vector Machine
apriori arules Rule based classification
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and regression
Data Science R Packages
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
• Decision Tree Model Results:
n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342
Decision Tree
rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
ΣRandom Forest
Tree1 Tree2
Treen
• Random Forest Model Results:
Number of observations used to build the model: 3792Type of random forest: classificationOOB estimate of error rate: 11.05%Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.errorDownstairs 204 7 0 1 64 97 0.45308311Jogging 6 1117 0 0 8 7 0.01845343Sitting 0 0 209 5 1 0 0.02790698Standing 4 0 0 177 4 0 0.04324324Upstairs 48 31 1 0 276 97 0.39072848Walking 20 1 1 1 15 1390 0.02661064
Random Forest Package in R
randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
• Fusion of data science and domain knowledge enables the big insights from the data
• R language provides a platform to rapidly build prototypes and test the ideas
• Getting data insights is an outcome of intense team effort between various stakeholders
16
Summary
• R Project: http://www.r-project.org• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05
• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010
• R wiki:http://rwiki.sciviews.org/doku.php
• R graph gallery:http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:http://cran.r-project.org/doc/contrib/Lemon-kickstart/
• Rattle – R Data Mining Tool [http://rattle.togaware.com/]• Sensor Platforms, http://www.sensorplatforms.com/context-aware/• Movea, http://www.movea.com/• Alohar, https://www.alohar.com
17
References