Revolution Confidential
Introduc tion to R for
Data Mining
2013 Webinar S eries
J os eph B . R ic kert
F ebruary 14, 2013
1
Revolution Confidential F irs t P olling Ques tion
What is your favorite data mining software tool? 1. R 2. SAS 3. MapReduce 4. Weka 5. Other
2
Revolution Confidential
My goal for today’s webinar is to convince you that:
3
R is a serious
platform for
data mining
Revolution R Enterprise
is the platform for
serious data mining
Seriously, it is not difficult to learn enough R to do some serious data
mining
Revolution Confidential
A word about Data Mining
We assume that you know a little bit about data mining and this is
your context for learning R
4
Revolution Confidential Data Mining
5
Applications
Credit Scoring
Fraud Detection
Ad Optimization
Targeted Marketing
Gene Detection
Recommendation systems
Social Networks
Actions
Acquire Data
Prepare
Classify
Predict
Visualize
Optimize
Interpret
Algorithms
CART
Random Forests
SVM
KMeans
Hierarchical clustering
Ensemble Techniques
Revolution Confidential Is :
The way to do statistical computing A full blown programming language The home of nearly every data mining
algorithm known to data science. A vibrant world-wide community
7
R was written in early 1990’s by
Robert Gentleman Ross Ihaka
Since 1997 a core group of ~ 20
developers guides the evolution of the
language
Revolution Confidential
is organized into libraries of functions c alled pac kages
CRAN R download Base Recommended packages
User contributed packages
8
R Package Growth 4,332 packages as of 2/13/13
Revolution Confidential F inding Your Way A round world of
Machine Learning Data Mining Visualization Finding Packages
Task Views crantastic.org
Blogs Revolutions R-Bloggers Quick-R Inside-R
Getting Help Finding R People
User Groups worldwide
Twitter : #rstats
9
Revolution Confidential L earning R ?
11
Levels of R Skill Write production grade code Write an R package Write code and algorithms Use R functions Use a GUI
R developer
R contributor
R programmer
R user
R aware
Hours of use
10 10,000
The Malcolm Gladwell “Outlier” Scale
Revolution Confidential B as ic Mac hine L earning F unc tions
12
Function Library Description Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and regression trees
ksvm kernlab Support Vector Machine apriori arules Rule based classification
Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and
regression
Revolution Confidential Noteworthy Data Mining P ac kages
13
Package Comment caret Well organized and remarkably complete
collection of functions to facilitate model building for regression and classification problems
rattle A very intuitive GUI for data mining that produces useful R code
Revolution Confidential
T IME TO R UN S OME C ODE Doing a lot with a little R
14
Script 1 GETTING STARTED .R 2 ROLL with RATTLE .R 3 IN THE TREES . R 4 INTRO to CARET .R 5 BIG DATA with RevoScaleR .R 6 WORDCLOUD .R
The R Scripts are available at: https://gist.github.com/joseph-rickert/4742529
Revolution Confidential S ec ond P olling Ques tion
What are your favorite data mining techniques? 1. Clustering techniques such as K-means 2. Single model classifiers such as decision trees,
or SVMs 3. Ensemble classifiers such as Random Forests
or boosting models 4. Text mining techniques 5. Other
15
Revolution Confidential
T hird P olling Ques tion (ins ert after running s cript IN T HE T R E E S
What kind of data do you analyze? 1. Financial data 2. Customer data (e.g. for recommendations) 3. Website data (e.g. for ads) 4. Health Care data 5. Other
16
Revolution Confidential Too B ig for Open S ourc e R
18
mortDF <- rxXdfToDataFrame(mdata,maxRowsByCols=300000000) model <- glm(default ~ .,data=mortDF,family="binomial")
Revolution Confidential
R evoS caleR brings the power of B ig Data to R
19
Distributed Statistical Algorithms
Communications Framework
Data Source API
R Language Interface
Parallel External Memory Algorithms that are distributed among available compute resources (cores & computers) independent of platform
Abstracted layer for providing
communication between compute nodes in a cluster
(MPI, MapReduce, In-Database)
API for integrating external data sources (files, databases, HDFS) that provides optimized reading of rows and columns in blocks
Familiar, high-prodictivity
programming paradigm for R users
Revolution Confidential
R evoS caleR P E MA s P arallel E xternal Memory A lgorithms
20
Block 1
Block 2
Block i
Block i + 1
Block i + 2
XDF File
Block i Block i + 1
Block i + 2
Read blocks and compute intermediate results in parallel, iterating as necessary Block 1
results
Block i results
Block i+1 results
Block i+2 results
Results from last block
2nd pass
3rd pass
1st pass
R based algorithms Work on blocks of data Inherently parallel and distributed Do not require all data to be in memory at one time Can deal with distributed and streaming data
Revolution Confidential C ontinuing to L earn R
Resources RevoJoe: How to Learn R More R Documentation
The R Journal Books Reference Card and more
Classes Coursera Revolution Analytics
Examples Thomson Nguyen on the Heritage
Health Prize Shannon Terry & Ben Ogorek
(Nationwide Insurance): A Direct Marketing In-Flight Forecasting System
Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment
Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R)
22