Upload
alexey-grigorev
View
447
Download
3
Embed Size (px)
Citation preview
CIKM CUP 2016: Track 1
Cross-Device LinkingAlexey Grigorev
Berlin Machine Learning2016.12.05
About Me
Software Developer BI Masters @ TU Berlin Data Scientist
CIKM Cup 2016: Cross-Device Linking
user advertisements ad providers
Goal: Restore the Graph
?
training data:know the links
new unseen devices:no links
Data
240k train “users” (devices), 100k test users500k train device-device pairs, 215k test pairs
Denormalized: 2.5 Gb click logs + 1 Gb URLs & titles67m clicks in total, 197 clicks per user on average
How to Approach?
● Machine Learning?● First, optimize Recall
○ IR, unsupervised○ Select “candidate” device-device pairs○ Build a design matrix
● Then, optimize Precision○ ML, supervised○ Push the true pairs up the list
● Select top K pairs s.t. F1 is max
Train Test
Optimizing Recall● Recall: fraction of all true device pairs we discover● Information Retrieval problem!● For each device need to find the most similar ones● Device == Document with
○ Tokens from all visited URL + Tokens from all titles○ Put them together into a one single document
● Then use standard IR methods like TF-IDF
Optimizing Recall
most similar
least similar
IR
ES MLT query Top 70 candidatesDevice(240k + 100k) * 70 = 24m
1
1
0
Optimizing Precision
1
1
0
0
0
1
● Now have high Recall, but low Precision○ Recall: fraction of all positive pairs we discover○ Precision: fraction of positive pairs within our results
● Use Supervised Machine Learning for improving it
Next steps:● Create features for each device pair● Train a ranking ML model ● Take top most reliable predictions
Features: Profiling● Create a profile for each user● Profile from sessions (30 minutes inactivity cut):
○ Session duration○ Number of visits per session○ Number of sessions with only one visit○ Duration of breaks between and within sessions○ Number of consecutive requests with a ≤ 1ms delay○ Starts and ends of sessions○ Number of unique domains per session○ Similarity of domains/urls/titles within each session○ For all features: min, mean, max and std
Features: Device-Device Similarities
● |profile1.feature - profile2.feature|● TF-IDF similarity of
○ Domains○ Titles○ URLs
● LSA similarity of the same● 54 features in total
Optimizing Precision: Ranking Model
1
1
0
1
0
1
0.90
0.87
0.2
0.3
0.7
Train 240k * 70 = 17m
100k-200kTop K
Test 100k * 70 = 7m
Features: Importance
pair features
profiledifferencefeatures
XGB feature importance: # times used in split
Cross-Validation
FOLD 1 FOLD 2vs
information leak!
Cross-Validation● Split the graph into non-overlapping
regions● For each region separately
○ Build ES index (i.e. apply filter)○ Build a model
● Evaluation (AUC + F1):○ Apply F1 model to F2 data○ And vise versa
F1 F2
0.90
0.87
0.7
0.90
0.87
0.7
EvaluationPublic/private test split: 50/50
● During the competition: ○ Evaluation on 1st half of data
● After the competition: 2nd half● P = 0.5 of real P
Test
normal F1 “real” evaluation function
Choosing K● Order the pairs by the probability● For each K calculate P, R and F1● Select best K such that F1 is max
● 8th position
Post-CompetitionWhat did others do?
● Using several candidate selection methods● Stacking with rank features (by D. Dremov)● Markov Clustering (by I. Bendyna)
Rank Features
source: http://gh.mltrainings.ru/presentations/Dremov_CIKMCup2016_DCA.pdf slide 9
● Relative position of a node within a group● Motivation: “local” within-group effect instead of global● df_train.groupby('user_1')[feature].rank()
Stacking (post competition)
all featuresXGBoost
ET
best features
XGBoost
rank features
8th → 5th position
Markov Clustering
source: http://gh.mltrainings.ru/presentations/Bendyna_CIKMCup2016_DCA.pdf
● take a connected component● add loops● put into a Markov Matrix M
○ also called “Stochastic Matrix”○ values in cols sum up to 1
● calculate M ** n○ ~ n Random Walk steps
● for each element M.v = M.v ** p○ makes weak links weaker
● re-normalize and repeat
Animation http://micans.org/mcl/ani/mcl-animation.html
Links & Further Info● Competition website: http://cikmcup.org/● Competition platform: https://competitions.codalab.org/competitions/11171● My solution: https://github.com/alexeygrigorev/cikm-cup-2016-cross-device● Reports: http://cikmcup.org/workshop.html
Self-promotion:
● http://alexeygrigorev.com/● [email protected]
Thank you. Questions?