Upload
amelie-anglade
View
1.519
Download
2
Tags:
Embed Size (px)
DESCRIPTION
These are the slides of the presentation I gave at the Realtime Conf EU on 23rd April 2013. The full abstract of the talk can be found here: http://lanyrd.com/2013/realtime-conf-europe/scdtyf/
Citation preview
DiscoRank: Optimizing Discoverability on SoundCloud
Amélie Anglade
• Developer at SoundCloud
• SoundCloud is the world’s largest social sound platform
• Academic background in Music Information Retrieval (MIR)
• Design, prototype and implement Machine Learning algorithms for music discovery
DISCOVERABILITY ?
PAGERANK
• The web is a graph:• nodes = web pages• edges = hyperlinks
• The (Page)rank of a node depends on the link structure of the graph
WEB AND PAGERANK
RANDOM SURFER
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
RANDOM SURFER
A
B
C
D
1/3
1/3
1/3
Nodes visited more often:• Nodes with many links• Coming from frequently visited nodes
RANDOM SURFER
A
B
C
D
E
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
Adjacency matrix A
COMPUTING THE PAGERANK
A
B
C
D
E
Transition probability matrix M
Probability distribution of surfer’s position
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
TELEPORT
A
B
C
D
E
If N nodes in graph, probability to teleport to any other node (including self) = 1/N
TELEPORT
A
B
C
D
E
1/N1/N
1/N
1/N
1/N
⇒
TELEPORT
A
B
C
D
E
1/N1/N
1/N
1/N
α?
1-α
1/N
At regular node: invoke teleport operation with probability α and standard random walk with probability (1 - α)
Probability distribution of the surfer at any time is a vector.
COMPUTING THE PAGERANK
That vector converges to a steady state: the PageRank vector.
PAGERANK EQUATION
SOUNDCLOUDDISCORANK
DISCORANK
A
B
C
D
EUser
User
Track
Playlist
favorite
follow
featured in
• Search across People, Sounds, Sets, Groups• One unique rank vector that contains all entities
• Weight the links based on the type of event:
• User favorites Track• Track is featured in Playlist
...
• New big (but sparse) adjacency matrix:
UNIVERSAL SEARCH
• How do we identify content that is trending?
• The more recent a listen, favorite, etc. (event) the higher the weight
• Multiply each event (=edge) by a time decay:
• New adjacency matrix:
BACK TO EXPLORE
PERFORMANCE OPTIMIZATION
• Millions of entities(=nodes) and events(=edges)
• First DiscoRank: several hours of computation
• Trimmed down to a few minutes using:• Sparse matrix• Optimized storage of the graph in memory• Versioned copies of the DiscoRank
• So technically we could compute the DiscoRank realtime
A VERY LARGE GRAPH
•
• Re-mapping entity ids
• Memory optimization so the graph holds in memory:• All edges details are stored in memory in a byte[]• buffer the byte[] into an opaque byte block pool• no object• sort the buffered byte[] in place
• On disk and when computing the DiscoRank:• Delta encoded ordered adjacency lists:
• One “from” node, several “to” nodes• Delta encode the “to” node ids
USING SPARSITY
• We keep versioned copies of:• the DiscoRank vector of results• the DiscoRank graph
• We rebuild the entire DiscoRank graph from scratch once a week
• In between:• we create additional graph segments with new
entities and events• and use as prior for the DiscoRank computation
the results of the previous DiscoRank run
• Side effect:• Also allows for experimentation
VERSIONED DISCORANK
• MySQL batch jobs
• DiscoRank results stored in HDFS
• At the end of every DiscoRank run we re-load it in ElasticSearch: • For each item we combine
its Lucene score with its DiscoRank
INTEGRATION IN OUR INFRASTRUCTURE
Amélie AngladeSound/Music Information Retrieval Engineer
about.me/utstikkar@utstikkar
We’re hiring!
www.soundcloud.com