Upload
machine-learning-prague
View
316
Download
9
Embed Size (px)
Citation preview
MatrixNet
Michael LevinChief Data Scientist
Yandex Data Factory› Created in 2014› Machine Learning for other
industries› Computing resources› Machine Learning
infrastructure› Data scientists
3
4
› Gradient Boosting over Decision Trees› Classification, Regression, Ranking› Strong results with default parameters› Easy to use› Highly optimized› Training can be local or parallelized on a cluster
What is MatrixNet?
› Web search ranking› Ads click prediction› External projects of YDF› Recommendations
MatrixNet applications at Yandex
› Bot detection› Resolving homonymy› User segmentation› …
5
6
› Oblivious trees
Some tricks & features
7
Regular vs oblivious treesDecision Tree Oblivious Trees
F1>3
F2>3
F1>6
F1>3
F2>3
F2>3
F2
F1
F2
F1
8
› Oblivious trees› Leaf regularization› Gradually increase model complexity› Different objectives: MSE, Log-loss, combinations and non-
standard› Feature binarization› Estimates feature importance and correlation
Some tricks & features
9
› Train based on judged (query, document) pairs› Excellent, Good, Moderate, Bad, Stupid› Features: query, document, query-document, url, host, link,…› Multiclassification, objective = cross-entropy› Regression: Excellent = 1, Stupid = 0, Good = 0.8, objective =
MSE› Ranking: objective - nDCG
Ranking
10
› Non-smooth, so no gradient for gradient boosting› Approximate smooth ranking objective› Alternative - pairwise approach› P(r(dij) < r(dik)) = σ(M(dij) – M(dik))› Maximize likelihood of data given predictions
Ranking
11
› Search ads› User enters query or clicks a link, advertiser enters keywords› Match query and keywords, then show the best ads› Which are the best?› Relevant ads which maximize revenue› Expected money = P(click) * Bid› Goodness = P(click) * Bid * Relevance
Ads click prediction
12
› Need to estimate probability of click› Solution: use log-loss› P(click) = σ(M(ad))
› Maximizes likelihood of data given the predictions› But if ranking doesn’t change, no point in better predictions› Don’t waste model resources on approximating probabilities› Use combination of classification and ranking objective
Ads click prediction
13
› Which telecom users are going to switch after a week?› A week is needed to prepare churn prevention campaign› Compared with telecom’s in-house model› Metric – Lift-10%› Won by 18.7% on CV, 11.5% on test data (churn rate grew 2x)› Got most of this delta with first application of MatrixNet
Churn prediction in telecom
14
› Multiple category features› Sparse features› Can’t “divide” discretized features› Continuous dependency› “Golden feature”
MatrixNet limitations
15
› MatrixNet is GBDT with bells and whistles› Handles numeric and categorical features› Almost no tuning is needed› Training is parallelized and optimized› Often superior to other available models› Some careful feature preparation is needed because of
limitations
Conclusions