50
caretパッケージ紹介 ~機械学習初心者に捧ぐ~ @dichika

Tokyo r11caret

  • Upload
    dichika

  • View
    108

  • Download
    1

Embed Size (px)

DESCRIPTION

tokyoR11

Citation preview

Page 1: Tokyo r11caret

caretパッケージ紹介~機械学習初心者に捧ぐ~

@dichika

Page 2: Tokyo r11caret

本日の内容

Page 3: Tokyo r11caret

•自己紹介•なぜ今caretか•caretレシピ

本日の内容

Page 4: Tokyo r11caret

自己紹介

Page 5: Tokyo r11caret

自己紹介

•@dichika(バタデ)•医療費・健診データ•Rは1年+α(R勉強会と同じ位)•バターデニッシュ

Page 6: Tokyo r11caret

さっそく本題へ

Page 7: Tokyo r11caret

なぜ今caretか

Page 8: Tokyo r11caret

機械学習を使う時…

Page 9: Tokyo r11caret

たとえばSVM1.2 Proposed Procedure

Many beginners use the following procedure now:

• Transform data to the format of an SVM package

• Randomly try a few kernels and parameters

• Test

We propose that beginners try the following procedure first:

• Transform data to the format of an SVM package

• Conduct simple scaling on the data

• Consider the RBF kernel K(x,y) = e−γ�x−y�2

• Use cross-validation to find the best parameter C and γ

• Use the best parameter C and γ to train the whole training set5

• Test

We discuss this procedure in detail in the following sections.

2 Data Preprocessing

2.1 Categorical Feature

SVM requires that each data instance is represented as a vector of real numbers.

Hence, if there are categorical attributes, we first have to convert them into numeric

data. We recommend using m numbers to represent an m-category attribute. Only

one of the m numbers is one, and others are zero. For example, a three-category

attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).

Our experience indicates that if the number of values in an attribute is not too large,

this coding might be more stable than using a single number.

5The best parameter might be affected by the size of data set but in practice the one obtainedfrom cross-validation is already suitable for the whole training set.

3

「A Practical Guide to Support Vector Classification」より

初心者はパラメータ等適当に決めがち

Page 10: Tokyo r11caret

たとえばSVM1.2 Proposed Procedure

Many beginners use the following procedure now:

• Transform data to the format of an SVM package

• Randomly try a few kernels and parameters

• Test

We propose that beginners try the following procedure first:

• Transform data to the format of an SVM package

• Conduct simple scaling on the data

• Consider the RBF kernel K(x,y) = e−γ�x−y�2

• Use cross-validation to find the best parameter C and γ

• Use the best parameter C and γ to train the whole training set5

• Test

We discuss this procedure in detail in the following sections.

2 Data Preprocessing

2.1 Categorical Feature

SVM requires that each data instance is represented as a vector of real numbers.

Hence, if there are categorical attributes, we first have to convert them into numeric

data. We recommend using m numbers to represent an m-category attribute. Only

one of the m numbers is one, and others are zero. For example, a three-category

attribute such as {red, green, blue} can be represented as (0,0,1), (0,1,0), and (1,0,0).

Our experience indicates that if the number of values in an attribute is not too large,

this coding might be more stable than using a single number.

5The best parameter might be affected by the size of data set but in practice the one obtainedfrom cross-validation is already suitable for the whole training set.

3

「A Practical Guide to Support Vector Classification」より

【良い結果】前処理

チューニングが必要

Page 11: Tokyo r11caret

結論:楽したいです

•機械学習便利そうだけどチューニングめんどい•そもそも機械学習はモデルが多すぎて、どれを使っていいかよくわからない

•もっとカジュアルに機械学習を使いたい

Page 12: Tokyo r11caret

そこでcaret

Page 13: Tokyo r11caret

caretとは

•classification and regression trainingの略

•100以上の機械学習その他のモデルを一つのスキームで評価できる

•ファイザーの人が作った

Page 14: Tokyo r11caret
Page 15: Tokyo r11caret

ところでファイザーといえば

Page 16: Tokyo r11caret

バイアグラ

Page 17: Tokyo r11caret

OKならば信用しよう

Page 18: Tokyo r11caret

ということで内容へ

Page 19: Tokyo r11caret

caretの動機

•機械学習のパッケージが多すぎるので一括で実験できる環境を作りたい

•並列処理させたい

Page 20: Tokyo r11caret

caretレシピ

1.前処理

2.モデル作成/評価

3.モデル比較

Page 21: Tokyo r11caret

caretレシピ

1.前処理

2.モデル作成/評価

3.モデル比較

Page 22: Tokyo r11caret

前処理

•情報量がゼロに近い変数を除く•相関が高い変数はどちらかを除く•データをモデル作成用と検証用に分ける

Page 23: Tokyo r11caret

前処理

nearZeroVar(data)

findCorrelation(data, cutoff = 0.8)

createDataPartition(class, p=0.75)

相関係数のカットオフ値

モデル作成に使うデータ割合

Page 24: Tokyo r11caret

使用例#がん細胞(白血病)の薬剤耐性に関するデータをロードdata(mdrr)#ロードするとmdrrDescrとmdrrClass

#情報量が少ない(分散がほぼ0)の変数を除くnzv <- nearZeroVar(mdrrDescr)filteredDescr <- mdrrDescr[,-nzv]

#相関が強い変数を削除descrCor <- cor(filteredDescr)highlyCorDescr <- findCorrelation(descrCor, cutoff =0.75)filteredDescr <- mdrrDescr[,-highlyCorDescr]

#データをテスト用とモデル作成用に分割する(今回は50%をtrainingに使う)inTrain <- createDataPartition(mdrrClass, p=0.5, list=FALSE)

trainDescr <- filteredDescr[inTrain, ]#モデル作成用testDescr <- filteredDescr[-inTrain, ]#テスト用

trainMDRR <- mdrrClass[inTrain]#モデル作成用testMDRR <- mdrrClass[-inTrain]#テスト用

Page 25: Tokyo r11caret

caretレシピ

1.前処理

2.モデル作成/評価

3.モデル比較

Page 26: Tokyo r11caret

モデル作成/評価

•各種モデルの作成•並列処理でスピードアップ

Page 27: Tokyo r11caret

trainで色々なモデル

train(data, class,

method=”rf”,

preProcess = c(“center”, “scale”),

trControl = trainControl)

モデル指定

Page 28: Tokyo r11caret

使えるモデルの一例Table 1: Models used in train

Model method Value Package Tuning Parameters

“Dual–Use Models”Generalized linear model glm stats None

glmStepAIC MASS None

Generalized additive model gam mgcv select, method

gamLoess gam span, degree

gamSpline gam df

Recursive Partitioning rpart rpart maxdepth

ctree party mincriterion

ctree2 party maxdepth

Boosted Trees gbm gbm n.trees, shrinkage

interaction.depth

blackboost mboost maxdepth, mstop

ada ada maxdepth, iter, nu

Other Boosted Models glmboost mboost mstop

gamboost mboost mstop

Random Forests rf randomForest mtry

parRF randomForest, foreach mtry

cforest party mtry

Boruta Boruta mtry

Bagging treebag ipred None

bag caret vars

logicBag logicFS ntrees, nleaves

Other Trees nodeHarvest nodeHarvest maxinter, mode

partDSA partDSA cut.off.growth, MPD

Multivariate Adaptive Regression Splines earth, mars earth degree, nprune

gcvEarth earth degree

Bagged MARS bagEarth caret, earth degree, nprune

Logic Regression logreg LogicReg ntrees, treesize

Elastic Net (glm) glmnet glmnet alpha, lambda

(continued on next page)

10

Page 29: Tokyo r11caret

trainで色々なモデル

train(data, class,

method=”rf”,

preProcess = c(“center”, “scale”),

trControl = trainControl)

データの規準化

trainingをカスタマイズ

Page 30: Tokyo r11caret

trainingのカスタマイズ

trainControl(method=”LGOCV”,

 p=0.75,

 number = 30 )

LGOCVで75%をtrainingに使い30回繰り返す

Page 31: Tokyo r11caret

並列処理でスピードアップ

•trainControlの中で指定する•以下2つのパッケージを使う場合が紹介されている

•Rmpi•nws(こっちがお手軽)

Page 32: Tokyo r11caret

使用例#先にtrainingの設定を行うfitControl <- trainControl(method=”LGOCV”, p=0.75, number=30)

#サポートベクターマシーンでtrainingsvmFit <- train(trainDescr, trainClass, method=”svmRadial”, preProcess=c(”center”, “scale”), trControl=fitControl)

#ランダムフォレストでtrainingrfFit <- train(trainDescr, trainClass, method=”rf”, preProcess=c(”center”, “scale”), trControl=fitControl)

Page 33: Tokyo r11caret

結果と評価(SVM)264 samples 95 predictors

Pre-processing: centered, scaled Resampling: Repeated Train/Test Splits (30 reps, 0.75%)

Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...

Resampling results across tuning parameters:

C Accuracy Kappa Accuracy SD Kappa SD 0.25 0.569 0 0 0 0.5 0.569 0 0 0 1 0.569 0 0 0

Tuning parameter 'sigma' was held constant at a value of 1.07e-06Accuracy was used to select the optimal model using the largest value.The final values used for the model were C = 0.25 and sigma = 1.07e-06.

Page 34: Tokyo r11caret

結果と評価(RF)264 samples 95 predictors

Pre-processing: centered, scaled Resampling: Repeated Train/Test Splits (30 reps, 0.75%)

Summary of sample sizes: 199, 199, 199, 199, 199, 199, ...

Resampling results across tuning parameters:

mtry Accuracy Kappa Accuracy SD Kappa SD 2 0.77 0.517 0.037 0.0828 48 0.772 0.528 0.0482 0.0995 95 0.771 0.526 0.0506 0.105

Accuracy was used to select the optimal model using the largest value.The final value used for the model was mtry = 48.

Page 35: Tokyo r11caret

caretレシピ

1.前処理

2.モデル作成/評価

3.モデル比較

Page 36: Tokyo r11caret

異なるモデルで比較

extractPrediction( list(models),

  testX = testdata,

  testY = testclass )

confusionMatrix(data$pred, data$obs)

作成したモデルを指定(単体でもリスト形式で)

Page 37: Tokyo r11caret

confusion matrix

実際に+

実際にー

+と予想 当たり

ーと予想 当たり

当たりの割合がaccuracy

Page 38: Tokyo r11caret

使用例#テストデータにモデルを当てはめるallPred <- extractPrediction(list(svmFit, rfFit), testX=testDescr, testY=testMDRR)

#”dataType”列に”Test”と入っているもののみを抜き出すtestPred <- subset(allPred, dataType==”Test”)

#confusionMatrixで比較tp_svm <- subset(testPred, model=="svmRadial")tp_rf <- subset(testPred, model=="rf")

confusionMatrix(tp_svm$pred, tp_svm$obs)confusionMatrix(tp_rf$pred, tp_rf$obs)

Page 39: Tokyo r11caret

使用例> confusionMatrix(tp_svm$pred, tp_svm$obs)Confusion Matrix and Statistics

ReferencePrediction Active Inactive Active 149 115 Inactive 0 0

Accuracy : 0.5644 95% CI : (0.5022, 0.6251) No Information Rate : 0.5644 P-Value [Acc > NIR] : 0.5258 Kappa : 0 Mcnemar's Test P-Value : <2e-16 Sensitivity : 1.0000 Specificity : 0.0000 Pos Pred Value : 0.5644 Neg Pred Value : NaN Prevalence : 0.5644 Detection Rate : 0.5644 Detection Prevalence : 1.0000 'Positive' Class : Active

> confusionMatrix(tp_rf$pred, tp_rf$obs)Confusion Matrix and Statistics

ReferencePrediction Active Inactive Active 138 37 Inactive 11 78

Accuracy : 0.8182 95% CI : (0.7663, 0.8628) No Information Rate : 0.5644 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6204 Mcnemar's Test P-Value : 0.0003080 Sensitivity : 0.9262 Specificity : 0.6783 Pos Pred Value : 0.7886 Neg Pred Value : 0.8764 Prevalence : 0.5644 Detection Rate : 0.5227 Detection Prevalence : 0.6629 'Positive' Class : Active

Page 40: Tokyo r11caret

今回ご紹介した関数はごく一部です

Page 41: Tokyo r11caret

おまけ

Page 42: Tokyo r11caret

機械学習に興味がでてきたらPRML等で勉強して実装するのもいいけど

Page 43: Tokyo r11caret

コンテストに応募してパラメータ調整とか計算速度とか

実務上の問題を把握していくのもいいですよ

Page 44: Tokyo r11caret

コンテストの効能

•賞金が出る•否応がなしに自分の実力を思い知らされる

•賞金が出る

Page 45: Tokyo r11caret

ということで試しにコンテストに応募した(caret使って)

Page 46: Tokyo r11caret

コンテストポータル

•TunedIT•kaggle

Page 47: Tokyo r11caret

R パッケージのリコメンドテスト

Page 48: Tokyo r11caret

結果(24位/48組中)

Page 49: Tokyo r11caret

中途半端

Page 50: Tokyo r11caret

それはともかくまとめ

•caretはドキュメントが充実してるので読むといいです

•初心者はとりあえずコンテストに出ましょう