Predicting Document Creation Times in News Citation Networks · News Citation Network Overview News...

Preview:

Citation preview

Predicting Document Creation Timesin News Citation Networks

Andreas Spitz1, Jannik Strötgen2, and Michael Gertz1

April 23, 2018 — TempWeb 2018, Lyon

1 Database Systems Research Group 2 Bosch Center for Artificial IntelligenceHeidelberg University, Germany Germany

Hm, when did this happen again?

1

News Citation Networks

News Citation Network Extraction

2

News Citation Network Overview

News articles from RSS feeds:

I Politics and business feeds

I 34 English news outlets(USA, UK, AUS, CAN, GER, CHN, QAT)

I 2 years (Nov 2015 - Oct 2017)

I 244.6 thousand articles

I 367.2 thousand edges

Used data:

I Hyperlinks in the article body

I Publication dates

I Temporal expressions3

News Outlet Statistics (sample)

short news outlet days 〈articles〉 〈temp exp〉 otherin otherout

AT The Atlantic 334 7.2 10.5 16.7 50.6BBC British Bc. Corp. 730 8.1 6.5 19.1 8.0DW Deutsche Welle 334 1.2 6.1 48.1 5.9FOX Fox News 548 2.7 9.8 0.0 0.0NPR National Public Radio 334 0.4 8.4 63.6 58.5NY The New Yorker 548 3.0 13.2 33.5 30.6NYT New York Times 669 23.8 10.7 26.8 4.7SMH Sydney Morn. Herald 548 2.3 7.0 3.0 51.9WP Washington Post 548 62.7 9.4 13.7 5.1

4

Evolution of Network Metrics

clustering coefficient average path length

average degree undirected diameter

2016−01 2016−07 2017−01 2017−07 2016−01 2016−07 2017−01 2017−07

0

20

40

60

5

10

15

1

2

3

0.0

0.2

0.4

0.6

days

mea

sure

val

ue

network aggregated politics business

5

Exploring Citation Chains

6

Article Publication Time Prediction

Task Definition: Publication Time Prediction

7

Available News Citation Network Data

Predict article publication times from:

I Citation network topology

I Publication dates of adjacent articles

I Temporal expressions in adjacent articles

I Not the metadata of the article itself

I Not the article content

8

Available News Citation Network Data

Predict article publication times from:

I Citation network topology

I Publication dates of adjacent articles

I Temporal expressions in adjacent articles

I Not the metadata of the article itself

I Not the article content

8

Feature Extraction

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Network Topology Features

Node degree-based features:

I Incoming degree

I Outgoing degree

I Undirected degree

Density-based features:

I Undirected local clustering coe�icient

Centrality-based features:

I Betweenness centrality

I Incoming closeness centrality

I Outgoing closeness centrality

I Page Rank centrality

9

Temporal Network Features

10

Temporal Expression Features

Correlation of temporal expressions:

I good with publication dates ofreferencing articles (incoming edges)

I bad with publication dates ofreferenced articles (outgoing edges)

11

Temporal Expression Features

Correlation of temporal expressions:

I good with publication dates ofreferencing articles (incoming edges)

I bad with publication dates ofreferenced articles (outgoing edges)

11

Missing Features and Imputation

Missing features

I 30.8% of feature values are missing

I 89.6% of articles are missing at least one feature

Imputation of missing values

I Column mean of the feature

12

Missing Features and Imputation

Missing features

I 30.8% of feature values are missing

I 89.6% of articles are missing at least one feature

Imputation of missing values

I Column mean of the feature

12

Evaluation

Regression Methods

Used regression methods:

I BASE: Baseline (average publication date of adjacent articles)

I LR: Linear regression

I BAY: Bayesian ridge regression (Laplace model)

I RF: Random forest

I GB: Gradient boosting (Laplace distribution, decision trees)

I SVM: Support vector machine (radial kernel)

I NN: Neural network (feedforward, one hidden layer)

13

Evaluation Results: Mean Absolute Error (days)

BASE LR BAY NN RF GB SVM

all 66.72 60.46 59.61 26.88 24.98 22.66 26.19in 88.88 66.48 87.55 34.03 32.25 27.49 32.29

out 87.32 59.54 40.24 32.52 30.10 26.68 30.77in+out 18.68 55.45 54.95 12.62 11.23 12.76 14.31

14

Distribution of Absolute Errors

out in+out

all in

BASE LR BAY NN RF GB SVM BASE LR BAY NN RF GB SVM

0

50

100

150

200

250

0

50

100

150

200

250

regression method

abso

lute

err

or (

days

)

method BASE LR BAY NN RF GB SVM

15

Recall by Varying Absolute Error

out in+out

all in

0 20 40 60 0 20 40 60

0

25

50

75

100

0

25

50

75

100

absolute error (days)

reca

ll (p

erce

ntag

e of

pre

dict

ions

< a

bsol

ute

erro

r)

method BASE LR BAY NN RF GB SVM

16

Feature Importance: Random Forest

●●●

●●

Feature importance: random forestm

ax (T

out)

min

(Tin)

µ (T ou

t)µ

(T in)

min

(Tou

t)m

ax (T

in)

max

(Xin)

µ (X

in) c pr

σ (T ou

t)σ

(Xin)

c cl,o

utsp

an (T

out)

σ (T in

)sp

an (T

in)

min

(Xin)

span

(Xin)

c cl,in

min

(Dis

t)de

g out

µ (D

ist)

deg in

deg al

lm

ax (D

ist)

c btw cc

σ (D

ist)

10−3

10−2

10−1

100

rela

tive

impo

rtan

ce

Feature type: ● network topology temporal expression temporal network

17

Feature Importance: Gradient Boosting

●●

Feature importance: gradient boostingm

ax (T

out)

min

(Tin)

deg ou

(T out)

min

(Dis

t)de

g in c prσ

(T out)

σ (T in

(T in)

deg al

lsp

an (T

in)

max

(Tin)

µ (X

in)

min

(Tou

t)µ

(Dis

t)c bt

wm

ax (X

in)

max

(Dis

t)sp

an (X

in)

σ (X

in)

span

(Tou

t)m

in (X

in)

c cl,o

utσ

(Dis

t)c cl

,in cc

10−5

10−4

10−3

10−2

10−1

100

rela

tive

impo

rtan

ce

Feature type: ● network topology temporal expression temporal network

18

Summary & Resources

Summary

News citation networks:

I Focus on anchored links inside the article body

I Constructed like a citation network between articles

Publication date prediction:

I Can be framed as a regression problem

I Average prediction error of 3 weeks

I Temporal network features are most discriminative

19

Resources

Data and implementation are available online:

I [data] News citation network (including URLs)

I [data] Temporal annotations

I [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20

Resources

Data and implementation are available online:

I [data] News citation network (including URLs)

I [data] Temporal annotations

I [code] Publication date prediction

https://dbs.ifi.uni-heidelberg.de/resources/data/

20

Recommended