Upload
dichika
View
108
Download
1
Embed Size (px)
DESCRIPTION
tokyoR11
Citation preview
caretパッケージ紹介~機械学習初心者に捧ぐ~
@dichika
本日の内容
•自己紹介•なぜ今caretか•caretレシピ
本日の内容
自己紹介
自己紹介
•@dichika(バタデ)•医療費・健診データ•Rは1年+α(R勉強会と同じ位)•バターデニッシュ
さっそく本題へ
なぜ今caretか
機械学習を使う時…
たとえばSVM1.2 Proposed Procedure
Many beginners use the following procedure now:
• Transform data to the format of an SVM package
• Randomly try a few kernels and parameters
• Test
We propose that beginners try the following procedure first:
• Transform data to the format of an SVM package
• Conduct simple scaling on the data
• Consider the RBF kernel K(x,y) = e−γ�x−y�2
• Use cross-validation to find the best parameter C and γ
• Use the best parameter C and γ to train the whole training set5
• Test
We discuss this procedure in detail in the following sections.
2 Data Preprocessing
2.1 Categorical Feature
SVM requires that each data instance is represented as a vector of real numbers.
Hence, if there are categorical attributes, we first have to convert them into numeric
data. We recommend using m numbers to represent an m-category attribute. Only
one of the m numbers is one, and others are zero. For example, a three-category
attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).
Our experience indicates that if the number of values in an attribute is not too large,
this coding might be more stable than using a single number.
5The best parameter might be affected by the size of data set but in practice the one obtainedfrom cross-validation is already suitable for the whole training set.
3
「A Practical Guide to Support Vector Classification」より
初心者はパラメータ等適当に決めがち
たとえばSVM1.2 Proposed Procedure
Many beginners use the following procedure now:
• Transform data to the format of an SVM package
• Randomly try a few kernels and parameters
• Test
We propose that beginners try the following procedure first:
• Transform data to the format of an SVM package
• Conduct simple scaling on the data
• Consider the RBF kernel K(x,y) = e−γ�x−y�2
• Use cross-validation to find the best parameter C and γ
• Use the best parameter C and γ to train the whole training set5
• Test
We discuss this procedure in detail in the following sections.
2 Data Preprocessing
2.1 Categorical Feature
SVM requires that each data instance is represented as a vector of real numbers.
Hence, if there are categorical attributes, we first have to convert them into numeric
data. We recommend using m numbers to represent an m-category attribute. Only
one of the m numbers is one, and others are zero. For example, a three-category
attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).
Our experience indicates that if the number of values in an attribute is not too large,
this coding might be more stable than using a single number.
5The best parameter might be affected by the size of data set but in practice the one obtainedfrom cross-validation is already suitable for the whole training set.
3
「A Practical Guide to Support Vector Classification」より
【良い結果】前処理
チューニングが必要
結論:楽したいです
•機械学習便利そうだけどチューニングめんどい•そもそも機械学習はモデルが多すぎて、どれを使っていいかよくわからない
•もっとカジュアルに機械学習を使いたい
そこでcaret
caretとは
•classification and regression trainingの略
•100以上の機械学習その他のモデルを一つのスキームで評価できる
•ファイザーの人が作った
ところでファイザーといえば
バイアグラ
OKならば信用しよう
ということで内容へ
caretの動機
•機械学習のパッケージが多すぎるので一括で実験できる環境を作りたい
•並列処理させたい
caretレシピ
1.前処理
2.モデル作成/評価
3.モデル比較
caretレシピ
1.前処理
2.モデル作成/評価
3.モデル比較
前処理
•情報量がゼロに近い変数を除く•相関が高い変数はどちらかを除く•データをモデル作成用と検証用に分ける
前処理
nearZeroVar(data)
findCorrelation(data, cutoff = 0.8)
createDataPartition(class, p=0.75)
相関係数のカットオフ値
モデル作成に使うデータ割合
使用例#がん細胞(白血病)の薬剤耐性に関するデータをロードdata(mdrr)#ロードするとmdrrDescrとmdrrClass
#情報量が少ない(分散がほぼ0)の変数を除くnzv <- nearZeroVar(mdrrDescr)filteredDescr <- mdrrDescr[,-nzv]
#相関が強い変数を削除descrCor <- cor(filteredDescr)highlyCorDescr <- findCorrelation(descrCor, cutoff =0.75)filteredDescr <- mdrrDescr[,-highlyCorDescr]
#データをテスト用とモデル作成用に分割する(今回は50%をtrainingに使う)inTrain <- createDataPartition(mdrrClass, p=0.5, list=FALSE)
trainDescr <- filteredDescr[inTrain, ]#モデル作成用testDescr <- filteredDescr[-inTrain, ]#テスト用
trainMDRR <- mdrrClass[inTrain]#モデル作成用testMDRR <- mdrrClass[-inTrain]#テスト用
caretレシピ
1.前処理
2.モデル作成/評価
3.モデル比較
モデル作成/評価
•各種モデルの作成•並列処理でスピードアップ
trainで色々なモデル
train(data, class,
method=”rf”,
preProcess = c(“center”, “scale”),
trControl = trainControl)
モデル指定
使えるモデルの一例Table 1: Models used in train
Model method Value Package Tuning Parameters
“Dual–Use Models”Generalized linear model glm stats None
glmStepAIC MASS None
Generalized additive model gam mgcv select, method
gamLoess gam span, degree
gamSpline gam df
Recursive Partitioning rpart rpart maxdepth
ctree party mincriterion
ctree2 party maxdepth
Boosted Trees gbm gbm n.trees, shrinkage
interaction.depth
blackboost mboost maxdepth, mstop
ada ada maxdepth, iter, nu
Other Boosted Models glmboost mboost mstop
gamboost mboost mstop
Random Forests rf randomForest mtry
parRF randomForest, foreach mtry
cforest party mtry
Boruta Boruta mtry
Bagging treebag ipred None
bag caret vars
logicBag logicFS ntrees, nleaves
Other Trees nodeHarvest nodeHarvest maxinter, mode
partDSA partDSA cut.off.growth, MPD
Multivariate Adaptive Regression Splines earth, mars earth degree, nprune
gcvEarth earth degree
Bagged MARS bagEarth caret, earth degree, nprune
Logic Regression logreg LogicReg ntrees, treesize
Elastic Net (glm) glmnet glmnet alpha, lambda
(continued on next page)
10
trainで色々なモデル
train(data, class,
method=”rf”,
preProcess = c(“center”, “scale”),
trControl = trainControl)
データの規準化
trainingをカスタマイズ
trainingのカスタマイズ
trainControl(method=”LGOCV”,
p=0.75,
number = 30 )
LGOCVで75%をtrainingに使い30回繰り返す
並列処理でスピードアップ
•trainControlの中で指定する•以下2つのパッケージを使う場合が紹介されている
•Rmpi•nws(こっちがお手軽)
使用例#先にtrainingの設定を行うfitControl <- trainControl(method=”LGOCV”, p=0.75, number=30)
#サポートベクターマシーンでtrainingsvmFit <- train(trainDescr, trainClass, method=”svmRadial”, preProcess=c(”center”, “scale”), trControl=fitControl)
#ランダムフォレストでtrainingrfFit <- train(trainDescr, trainClass, method=”rf”, preProcess=c(”center”, “scale”), trControl=fitControl)
結果と評価(SVM)264 samples 95 predictors
Pre-processing: centered, scaled Resampling: Repeated Train/Test Splits (30 reps, 0.75%)
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...
Resampling results across tuning parameters:
C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.569 0 0 0 0.5 0.569 0 0 0 1 0.569 0 0 0
Tuning parameter 'sigma' was held constant at a value of 1.07e-06Accuracy was used to select the optimal model using the largest value.The final values used for the model were C = 0.25 and sigma = 1.07e-06.
結果と評価(RF)264 samples 95 predictors
Pre-processing: centered, scaled Resampling: Repeated Train/Test Splits (30 reps, 0.75%)
Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.77 0.517 0.037 0.0828 48 0.772 0.528 0.0482 0.0995 95 0.771 0.526 0.0506 0.105
Accuracy was used to select the optimal model using the largest value.The final value used for the model was mtry = 48.
caretレシピ
1.前処理
2.モデル作成/評価
3.モデル比較
異なるモデルで比較
extractPrediction( list(models),
testX = testdata,
testY = testclass )
confusionMatrix(data$pred, data$obs)
作成したモデルを指定(単体でもリスト形式で)
confusion matrix
実際に+
実際にー
+と予想 当たり
ーと予想 当たり
当たりの割合がaccuracy
使用例#テストデータにモデルを当てはめるallPred <- extractPrediction(list(svmFit, rfFit), testX=testDescr, testY=testMDRR)
#”dataType”列に”Test”と入っているもののみを抜き出すtestPred <- subset(allPred, dataType==”Test”)
#confusionMatrixで比較tp_svm <- subset(testPred, model=="svmRadial")tp_rf <- subset(testPred, model=="rf")
confusionMatrix(tp_svm$pred, tp_svm$obs)confusionMatrix(tp_rf$pred, tp_rf$obs)
使用例> confusionMatrix(tp_svm$pred, tp_svm$obs)Confusion Matrix and Statistics
ReferencePrediction Active Inactive Active 149 115 Inactive 0 0
Accuracy : 0.5644 95% CI : (0.5022, 0.6251) No Information Rate : 0.5644 P-Value [Acc > NIR] : 0.5258 Kappa : 0 Mcnemar's Test P-Value : <2e-16 Sensitivity : 1.0000 Specificity : 0.0000 Pos Pred Value : 0.5644 Neg Pred Value : NaN Prevalence : 0.5644 Detection Rate : 0.5644 Detection Prevalence : 1.0000 'Positive' Class : Active
> confusionMatrix(tp_rf$pred, tp_rf$obs)Confusion Matrix and Statistics
ReferencePrediction Active Inactive Active 138 37 Inactive 11 78
Accuracy : 0.8182 95% CI : (0.7663, 0.8628) No Information Rate : 0.5644 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6204 Mcnemar's Test P-Value : 0.0003080 Sensitivity : 0.9262 Specificity : 0.6783 Pos Pred Value : 0.7886 Neg Pred Value : 0.8764 Prevalence : 0.5644 Detection Rate : 0.5227 Detection Prevalence : 0.6629 'Positive' Class : Active
今回ご紹介した関数はごく一部です
おまけ
機械学習に興味がでてきたらPRML等で勉強して実装するのもいいけど
コンテストに応募してパラメータ調整とか計算速度とか
実務上の問題を把握していくのもいいですよ
コンテストの効能
•賞金が出る•否応がなしに自分の実力を思い知らされる
•賞金が出る
ということで試しにコンテストに応募した(caret使って)
コンテストポータル
•TunedIT•kaggle
R パッケージのリコメンドテスト
結果(24位/48組中)
中途半端
それはともかくまとめ
•caretはドキュメントが充実してるので読むといいです
•初心者はとりあえずコンテストに出ましょう