DataEngConf: The Science of Virality at BuzzFeed

Preview:

Citation preview

HISTORY OF VIRALITY

THE DATA

THE DATA: OLD VERSION

Article being viewedUser viewing articleTime of pageviewReferring domain

THE DATA: NEW VERSION

Article being viewed

Time of pageviewReferring domain

User viewing article

Referring User

DIFFERENT PERSPECTIVE:

Pageviews are a process on a graph!

WHAT THE GRAPH LOOKS LIKE:

WHAT THE PROCESS LOOKS LIKE:

WHAT THE DATA LOOKS LIKE:

WHAT CAN DO YOU WITH OLD PAGEVIEWS?

(Educated)

Guess!

CONNIE

OLD GRAPH RECONSTRUCTION: MODEL-BASED INFERENCEProbabilistic: You can infer connections that aren’t there! Error Prone: Graph statistics can be susceptible to small changes in the graph

Gets larger when differences in pageview times gets smaller

SIMPLIFIED VERSION:Observe:

Guess:

SIMPLIFIED VERSION:Guess:

Reality:

Check out a toy implementation here!

github.com/akellehe/pyconnie

NEW GRAPH RECONSTRUCTION: TRIVIAL

These are actually Unique Visitors …

LIFE IS A LITTLE MESSY…

This is more like what the Pageview graph looks like

PROBLEM: DATA MUNGING• Lots of potential for heuristics!• How do we get promotion attribution from

propagations?• Trees are important: how can we be sure

we get them?

PROBLEM: STREAMLINING ANALYSIS• How do we work from a common set of definitions?• How do we avoid repeating analysis?• How can we streamline data visualization? EDA?• How do we share optimized analyses? And avoid

inefficient (but correct) algorithms?

DEFINE DATA STRUCTURES!• All data munging happens “under the hood”• Data pre-processing is unit-tested• No room for heuristics: standardization!• Hard math definitions can be consistency-checked!

PROPAGATION SETFor one article

For the site (or other set of articles, S)

PROPAGATION SETPageviews to article b in time T

Pageviews to the site in time T

The simplest data structure. Just a representation of the raw pageview logs.

Represented as a generator of UserEdge objects

PROPAGATION GRAPH,

PROPAGATION GRAPH

PROPAGATION GRAPH

INFLUENCE GRAPHPropagation graph together with a map,

That measures the influence of the origin user in p on the pageviewing user

CONSIDER:

PROPAGATION FOREST

PROPAGATION FORESTThe propagation graph is great, but we’d also like a concept like unique visitors!

If there is attribution ordering in the graph, we can trace content back to its source!

PROPAGATION FOREST: FIRST PARENT ATTRIBUTION

n pageviews One UV

PROPAGATION FOREST gets the credit

RESULT: ALL GRAPHS ARE FORESTS

Promotions have 0 indegree,Users have 1 indegree

total edges in connected components:

Trees!

CAREFUL FOR EDGE CASES: MISSING DATA?All connected components should be rooted at a promotion source.

What happens if we lose the first edge (e.g. use the wrong T)?

PROPAGATION FOREST: CYCLE BREAKINGConsider … Cycle is not broken by

first-parent attribution

Traversal algorithms go on forever!

PROPAGATION FOREST: CYCLE BREAKINGConsider … As long as they’re not equal,

the can be ordered, say

Then, there is a node in the cycle with an out-edge younger than its in-edge:

The original pageview for that node must have been lost. Cut the in-edge (FPA!).

SUCCESS!Cycle-breaking + FPA = Trees!

Each tree is the UV graph downstream from a promotion source: promotion attribution!

Additional Benefits:Most information diffusion analyses model trees growing on graphs.

Many algorithms simplify when run on trees!

SUPERTREEWe may want to run an algorithm, or calculate a tree statistic from a whole forest, instead of just one tree. How can we do that?

Merge all the roots (promotion sources) together into one “super-node”

The whole forest becomes a SuperTree!

SUPERTREE: EXAMPLE

SUPERTREE: EXAMPLE

APPLICATION: LARGE SCALE DATA VIS

WHY IS IT SLOW?Layouts often consider repelling each node from every other: time complexity

Good for a few thousand nodes

OPENORD: SIMULATED ANNEALINGLinear main layout

Quadratic settling Phase

Implemented in Gephi

OPENORDGood for ~10k Users

Slow for ~100k Users

Messy! (if you skipthe quadratic step!)

TAKE ADVANTAGE OF TREE STRUCTURE!

Traverse the tree to decide where to place nodes!

H3 LAYOUTEach parent is in the center of a hemisphere.

Children are laid out on the surface of the hemisphere

They become centers of smaller hemispheres (if they’re parents)

Etc.

A NEW IMPLEMENTATIONpip install pyh3

WITH D3

MORE APPLICATIONS

ATTRIBUTION

Instead of

CASCADE PREDICTION

GRAPH AND TEMPORAL PROPERTIES ARE IMPORTANT!

TEST THE INFLUENTIALS HYPOTHESIS

IMPROVE CONTENT TARGETING

FINDING THE CAUSES OF VIRALITYConsider Fitting a Model:

User Features, content features, context features, User pair features

UNDER CONSTRUCTION:Online Regression!

Real-time feature weights tell which features correlate with propagation probabilities!

Drives hypothesis-building!

THE TEAM