Temporal Anchor Text as Proxy for user Queries

Temporal Anchor Text as Proxy for User Queries

Thaer Samar, Arjen P. de Vries

Web Archiving 1/2

The Web is a major source of published

information

Content on the Web evolves and changes

continuously

Many initiatives aim to archive the Web

Petabytes of archived data

Web Archiving 2/2

Web archives are incomplete

Impossible to include all Web pages due to

crawling limitations e.g., [Masanès06]

Depth-first crawl, focus only on selected web sites

Breadth-first crawl, focus on the entire domain,

but not in depth

Reconstruct Queries

Our study: evolution of anchor text over time

to reconstruct what was important in the past

Information that would be similar to user queries

Inspiration:

Document titles can be used as an approximation

of user queries [Jin et al.]

Anchor text exhibits characteristics similar to user

query and document title [Eiron & McCurley]

Queries in the Past

User queries have usually not been preserved

Impossible to reconstruct which queries the

user would have used to search the archive

However, web archives contain more than the

Web page content

E.g., page source, different timestamps (archive

date, last-modified date), link structure

Link evidence and anchor Text

Link information represents the source URL, destination URL, and the anchor text

Anchor text is a short text describing the destination page

Has been shown to improve search effectiveness in a large number of Information Retrieval studies

`

Source

http://www.cwi.nl

Destination

http://www.nwo.nl

‘NWO’

Data: Dutch Web Archive

National Library of the Netherlands (KB)

Depth-first (selective) Web archive

Since 2007

10+ TB

8,000+ websites

Our snapshot

2009-2012

Link Processing

Filtering text/html pages

~70% of archived

objects

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl> NWO </a>

</html>

Web Archive Record

Link Processing


~70% of archived objects

Extraction

Source URL




<html>


</html>

Web Archive Record

Link Processing



Extraction

Source URL

Destination URL




<html>

<a href=http://www.nwo.nl >NWO </a>

</html>

Web Archive Record

Link Processing



Extraction

Source URL

Destination URL

Anchor text




<html>


</html>

Web Archive Record

Link Processing



Extraction

Source URL

Destination URL

Anchor text

Archive-date

(YYYYMM)




<html>


</html>

Web Archive Record

Link Processing

Filtering Pages of type text/html


Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

URL normalization; get host of

the source and the destination

Clean spam e.g., rolex watches

Cleaning

Link Processing



Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

Cleaning URL normalization; get host of the source

and the destination


Partitioning Based on one-year and one-month granularity

Link Processing



Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

Cleaning URL normalization; get host of the source

and the destination


Partitioning Based on one-year and one-month granularity

Deduplication

Remove duplicate links; due to crawling

frequency

Same source, destination, and anchor text

Hosts Evolution

Important hosts overtime

Aggregate links based on the target host

keep unique source hosts

Multiple pages from same host linking to the same

target host are counted as one

Rank hosts based on number of source hosts

linking to them

% of new hosts over the years

% New hosts in 2012 not in {2009, 2010, and

2011}

Anchor Text Evolution

Measure the importance of anchor text a over

time in time-partitioned links

Aggregate by anchor text

Compute the archive-based popularity

Normalize by Maximum

% new anchor text over years

Anchor text is new in specific partition if does

not appear in the previous partitions

Based on one-year granularity

59% new anchor text

Based on one-month granularity

34% new anchor text

WikiStats

Views aggregation of Wikipedia (WP) pages

From Jan 2008 to Jan 2015

We focus on

Feb 2009 to Dec 2012

Similar to the period of our snapshot of the Dutch

Web archive

Keep WP titles viewed >= 1,000 times

Matching anchor text to WP titles

Pre-process WP titles like the anchor text

Lowercase

Stop-words removing

One-year and one-month granularity partitions

Collect titles by exact match with the anchors

Assume anchor popularity equals WP page

popularity

Ranked anchor text with WP match

Different rank cut-off

% overlap decreases while cut-off increases

~56 % in top- 1k has a match

Examples of popular anchor text (with match)

Major cities in the Netherlands

E.g., Amsterdam, Rotterdam, Groningen, and Utrecht

Social web sites

E.g., twitter, linkedin, flickr, and vimeo

Major Dutch daily newspapers

E.g., de Volkskrant, Telegraaf, and Trouw

Dutch public broadcasting

uitzending gemist

Government web service

E.g., belastingdienst

Discussion

Our original goal was to identify historically

trending events from the link evolution

recorded in the archive

Unfortunately we found only few examples

with our current analysis

E.g., ‘‘canon’’ *

However, important anchor text provides and

overview of important Dutch entities

* corresponding to an activity initiated by the government to define

the canonical historic events in Dutch history

Limitations & Future Work

Exact text matching between anchor text and

WP title

E.g., filmpje does not match WP title filmpje!

Additional pre-processing

Stemming, stopping, generalize from exact match to

match with low edit distance

Our analysis is based on depth-first crawl of

few thousand of Dutch websites

Breadth-first crawl such as [CommonCrawl]

References

[Masanés06] J. Masanés. Web Archiving. Springer, 2006

[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.

Title language model for information retrieval. In SIGIR 2002

Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of

anchor text for web search. In SIGIR 2003

[CommonCrawl] https://commoncrawl.org/

[WikiStats] http://wikistats.ins.cwi.nl/

https://commoncrawl.org/

https://commoncrawl.org/

http://wikistats.ins.cwi.nl/

Limitations & Future Work

Exact text matching between anchor text and

WP title

E.g., filmpje does not match WP title filmpje!

Additional pre-processing

Stemming, stopping, generalize from exact match to

match with low edit distance

Our analysis is based on depth-first crawl of

few thousand of Dutch websites

Breadth-first crawl such as [CommonCrawl]

Science

Temporal Anchor Text as Proxy for user Queries