Michael Levin - MatrixNet Applications at Yandex

MatrixNet

Michael LevinChief Data Scientist

Yandex Data Factory› Created in 2014› Machine Learning for other

industries› Computing resources› Machine Learning

infrastructure› Data scientists

3

4

› Gradient Boosting over Decision Trees› Classification, Regression, Ranking› Strong results with default parameters› Easy to use› Highly optimized› Training can be local or parallelized on a cluster

What is MatrixNet?

› Web search ranking› Ads click prediction› External projects of YDF› Recommendations

MatrixNet applications at Yandex

› Bot detection› Resolving homonymy› User segmentation› …

5

6

› Oblivious trees

Some tricks & features

7

Regular vs oblivious treesDecision Tree Oblivious Trees

F1>3

F2>3

F1>6

F1>3

F2>3

F2>3

F2

F1

F2

F1

8

› Oblivious trees› Leaf regularization› Gradually increase model complexity› Different objectives: MSE, Log-loss, combinations and non-

standard› Feature binarization› Estimates feature importance and correlation

Some tricks & features

9

› Train based on judged (query, document) pairs› Excellent, Good, Moderate, Bad, Stupid› Features: query, document, query-document, url, host, link,…› Multiclassification, objective = cross-entropy› Regression: Excellent = 1, Stupid = 0, Good = 0.8, objective =

MSE› Ranking: objective - nDCG

Ranking

10

› Non-smooth, so no gradient for gradient boosting› Approximate smooth ranking objective› Alternative - pairwise approach› P(r(dij) < r(dik)) = σ(M(dij) – M(dik))› Maximize likelihood of data given predictions

Ranking

11

› Search ads› User enters query or clicks a link, advertiser enters keywords› Match query and keywords, then show the best ads› Which are the best?› Relevant ads which maximize revenue› Expected money = P(click) * Bid› Goodness = P(click) * Bid * Relevance

Ads click prediction

12

› Need to estimate probability of click› Solution: use log-loss› P(click) = σ(M(ad))

› Maximizes likelihood of data given the predictions› But if ranking doesn’t change, no point in better predictions› Don’t waste model resources on approximating probabilities› Use combination of classification and ranking objective

Ads click prediction

13

› Which telecom users are going to switch after a week?› A week is needed to prepare churn prevention campaign› Compared with telecom’s in-house model› Metric – Lift-10%› Won by 18.7% on CV, 11.5% on test data (churn rate grew 2x)› Got most of this delta with first application of MatrixNet

Churn prediction in telecom

14

› Multiple category features› Sparse features› Can’t “divide” discretized features› Continuous dependency› “Golden feature”

MatrixNet limitations

15

› MatrixNet is GBDT with bells and whistles› Handles numeric and categorical features› Almost no tuning is needed› Training is parallelized and optimized› Often superior to other available models› Some careful feature preparation is needed because of

limitations

Conclusions

Contacts

[email protected]

Michael Levin

Chief Data Scientist,

Yandex Data Factory

Technology

Michael Levin - MatrixNet Applications at Yandex