27
Temporal Anchor Text as Proxy for User Queries Thaer Samar, Arjen P. de Vries

Temporal Anchor Text as Proxy for user Queries

Embed Size (px)

Citation preview

Page 1: Temporal Anchor Text as Proxy for user Queries

Temporal Anchor Text as Proxy for User Queries

Thaer Samar, Arjen P. de Vries

Page 2: Temporal Anchor Text as Proxy for user Queries

Web Archiving 1/2

The Web is a major source of published

information

Content on the Web evolves and changes

continuously

Many initiatives aim to archive the Web

Petabytes of archived data

Page 3: Temporal Anchor Text as Proxy for user Queries

Web Archiving 2/2

Web archives are incomplete

Impossible to include all Web pages due to

crawling limitations e.g., [Masanès06]

Depth-first crawl, focus only on selected web sites

Breadth-first crawl, focus on the entire domain,

but not in depth

Page 4: Temporal Anchor Text as Proxy for user Queries

Reconstruct Queries

Our study: evolution of anchor text over time

to reconstruct what was important in the past

Information that would be similar to user queries

Inspiration:

Document titles can be used as an approximation

of user queries [Jin et al.]

Anchor text exhibits characteristics similar to user

query and document title [Eiron & McCurley]

Page 5: Temporal Anchor Text as Proxy for user Queries

Queries in the Past

User queries have usually not been preserved

Impossible to reconstruct which queries the

user would have used to search the archive

However, web archives contain more than the

Web page content

E.g., page source, different timestamps (archive

date, last-modified date), link structure

Page 6: Temporal Anchor Text as Proxy for user Queries

Link evidence and anchor Text

Link information represents the source URL, destination URL, and the anchor text

Anchor text is a short text describing the destination page

Has been shown to improve search effectiveness in a large number of Information Retrieval studies

`

Source

http://www.cwi.nl

Destination

http://www.nwo.nl

‘NWO’

Page 7: Temporal Anchor Text as Proxy for user Queries

Data: Dutch Web Archive

National Library of the Netherlands (KB)

Depth-first (selective) Web archive

Since 2007

10+ TB

8,000+ websites

Our snapshot

2009-2012

Page 8: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering text/html pages

~70% of archived

objects

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl> NWO </a>

</html>

Web Archive Record

Page 9: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering text/html pages

~70% of archived objects

Extraction

Source URL

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl> NWO </a>

</html>

Web Archive Record

Page 10: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering text/html pages

~70% of archived objects

Extraction

Source URL

Destination URL

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl >NWO </a>

</html>

Web Archive Record

Page 11: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering text/html pages

~70% of archived objects

Extraction

Source URL

Destination URL

Anchor text

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl> NWO </a>

</html>

Web Archive Record

Page 12: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering text/html pages

~70% of archived objects

Extraction

Source URL

Destination URL

Anchor text

Archive-date

(YYYYMM)

URL: http://www.cwi.nl

Archive-Date: 20091201

Content-Type: text/html

<html>

<a href=http://www.nwo.nl> NWO </a>

</html>

Web Archive Record

Page 13: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering Pages of type text/html

~70% of archived objects

Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

URL normalization; get host of

the source and the destination

Clean spam e.g., rolex watches

Cleaning

Page 14: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering Pages of type text/html

~70% of archived objects

Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

Cleaning URL normalization; get host of the source

and the destination

Clean spam e.g., rolex watches

Partitioning Based on one-year and one-month granularity

Page 15: Temporal Anchor Text as Proxy for user Queries

Link Processing

Filtering Pages of type text/html

~70% of archived objects

Extraction

Source URL

Destination URL

Anchor text

Crawl-date (YYYYMM)

Cleaning URL normalization; get host of the source

and the destination

Clean spam e.g., rolex watches

Partitioning Based on one-year and one-month granularity

Deduplication

Remove duplicate links; due to crawling

frequency

Same source, destination, and anchor text

Page 16: Temporal Anchor Text as Proxy for user Queries

Hosts Evolution

Important hosts overtime

Aggregate links based on the target host

keep unique source hosts

Multiple pages from same host linking to the same

target host are counted as one

Rank hosts based on number of source hosts

linking to them

Page 17: Temporal Anchor Text as Proxy for user Queries

% of new hosts over the years

% New hosts in 2012 not in {2009, 2010, and

2011}

Page 18: Temporal Anchor Text as Proxy for user Queries

Anchor Text Evolution

Measure the importance of anchor text a over

time in time-partitioned links

Aggregate by anchor text

Compute the archive-based popularity

Normalize by Maximum

Page 19: Temporal Anchor Text as Proxy for user Queries

% new anchor text over years

Anchor text is new in specific partition if does

not appear in the previous partitions

Based on one-year granularity

59% new anchor text

Based on one-month granularity

34% new anchor text

Page 20: Temporal Anchor Text as Proxy for user Queries

WikiStats

Views aggregation of Wikipedia (WP) pages

From Jan 2008 to Jan 2015

We focus on

Feb 2009 to Dec 2012

Similar to the period of our snapshot of the Dutch

Web archive

Keep WP titles viewed >= 1,000 times

Page 21: Temporal Anchor Text as Proxy for user Queries

Matching anchor text to WP titles

Pre-process WP titles like the anchor text

Lowercase

Stop-words removing

One-year and one-month granularity partitions

Collect titles by exact match with the anchors

Assume anchor popularity equals WP page

popularity

Page 22: Temporal Anchor Text as Proxy for user Queries

Ranked anchor text with WP match

Different rank cut-off

% overlap decreases while cut-off increases

~56 % in top- 1k has a match

Page 23: Temporal Anchor Text as Proxy for user Queries

Examples of popular anchor text (with match)

Major cities in the Netherlands

E.g., Amsterdam, Rotterdam, Groningen, and Utrecht

Social web sites

E.g., twitter, linkedin, flickr, and vimeo

Major Dutch daily newspapers

E.g., de Volkskrant, Telegraaf, and Trouw

Dutch public broadcasting

uitzending gemist

Government web service

E.g., belastingdienst

Page 24: Temporal Anchor Text as Proxy for user Queries

Discussion

Our original goal was to identify historically

trending events from the link evolution

recorded in the archive

Unfortunately we found only few examples

with our current analysis

E.g., ‘‘canon’’ *

However, important anchor text provides and

overview of important Dutch entities

* corresponding to an activity initiated by the government to define

the canonical historic events in Dutch history

Page 25: Temporal Anchor Text as Proxy for user Queries

Limitations & Future Work

Exact text matching between anchor text and

WP title

E.g., filmpje does not match WP title filmpje!

Additional pre-processing

Stemming, stopping, generalize from exact match to

match with low edit distance

Our analysis is based on depth-first crawl of

few thousand of Dutch websites

Breadth-first crawl such as [CommonCrawl]

Page 26: Temporal Anchor Text as Proxy for user Queries

References

[Masanés06] J. Masanés. Web Archiving. Springer, 2006

[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.

Title language model for information retrieval. In SIGIR 2002

Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of

anchor text for web search. In SIGIR 2003

[CommonCrawl] https://commoncrawl.org/

[WikiStats] http://wikistats.ins.cwi.nl/

Page 27: Temporal Anchor Text as Proxy for user Queries

Limitations & Future Work

Exact text matching between anchor text and

WP title

E.g., filmpje does not match WP title filmpje!

Additional pre-processing

Stemming, stopping, generalize from exact match to

match with low edit distance

Our analysis is based on depth-first crawl of

few thousand of Dutch websites

Breadth-first crawl such as [CommonCrawl]