Upload
thaer-samar
View
392
Download
0
Embed Size (px)
Citation preview
Temporal Anchor Text as Proxy for User Queries
Thaer Samar, Arjen P. de Vries
Web Archiving 1/2
The Web is a major source of published
information
Content on the Web evolves and changes
continuously
Many initiatives aim to archive the Web
Petabytes of archived data
Web Archiving 2/2
Web archives are incomplete
Impossible to include all Web pages due to
crawling limitations e.g., [Masanès06]
Depth-first crawl, focus only on selected web sites
Breadth-first crawl, focus on the entire domain,
but not in depth
Reconstruct Queries
Our study: evolution of anchor text over time
to reconstruct what was important in the past
Information that would be similar to user queries
Inspiration:
Document titles can be used as an approximation
of user queries [Jin et al.]
Anchor text exhibits characteristics similar to user
query and document title [Eiron & McCurley]
Queries in the Past
User queries have usually not been preserved
Impossible to reconstruct which queries the
user would have used to search the archive
However, web archives contain more than the
Web page content
E.g., page source, different timestamps (archive
date, last-modified date), link structure
Link evidence and anchor Text
Link information represents the source URL, destination URL, and the anchor text
Anchor text is a short text describing the destination page
Has been shown to improve search effectiveness in a large number of Information Retrieval studies
`
Source
http://www.cwi.nl
Destination
http://www.nwo.nl
‘NWO’
Data: Dutch Web Archive
National Library of the Netherlands (KB)
Depth-first (selective) Web archive
Since 2007
10+ TB
8,000+ websites
Our snapshot
2009-2012
Link Processing
Filtering text/html pages
~70% of archived
objects
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl >NWO </a>
</html>
Web Archive Record
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering text/html pages
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Archive-date
(YYYYMM)
URL: http://www.cwi.nl
Archive-Date: 20091201
Content-Type: text/html
<html>
<a href=http://www.nwo.nl> NWO </a>
</html>
Web Archive Record
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
URL normalization; get host of
the source and the destination
Clean spam e.g., rolex watches
Cleaning
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
Link Processing
Filtering Pages of type text/html
~70% of archived objects
Extraction
Source URL
Destination URL
Anchor text
Crawl-date (YYYYMM)
Cleaning URL normalization; get host of the source
and the destination
Clean spam e.g., rolex watches
Partitioning Based on one-year and one-month granularity
Deduplication
Remove duplicate links; due to crawling
frequency
Same source, destination, and anchor text
Hosts Evolution
Important hosts overtime
Aggregate links based on the target host
keep unique source hosts
Multiple pages from same host linking to the same
target host are counted as one
Rank hosts based on number of source hosts
linking to them
% of new hosts over the years
% New hosts in 2012 not in {2009, 2010, and
2011}
Anchor Text Evolution
Measure the importance of anchor text a over
time in time-partitioned links
Aggregate by anchor text
Compute the archive-based popularity
Normalize by Maximum
% new anchor text over years
Anchor text is new in specific partition if does
not appear in the previous partitions
Based on one-year granularity
59% new anchor text
Based on one-month granularity
34% new anchor text
WikiStats
Views aggregation of Wikipedia (WP) pages
From Jan 2008 to Jan 2015
We focus on
Feb 2009 to Dec 2012
Similar to the period of our snapshot of the Dutch
Web archive
Keep WP titles viewed >= 1,000 times
Matching anchor text to WP titles
Pre-process WP titles like the anchor text
Lowercase
Stop-words removing
One-year and one-month granularity partitions
Collect titles by exact match with the anchors
Assume anchor popularity equals WP page
popularity
Ranked anchor text with WP match
Different rank cut-off
% overlap decreases while cut-off increases
~56 % in top- 1k has a match
Examples of popular anchor text (with match)
Major cities in the Netherlands
E.g., Amsterdam, Rotterdam, Groningen, and Utrecht
Social web sites
E.g., twitter, linkedin, flickr, and vimeo
Major Dutch daily newspapers
E.g., de Volkskrant, Telegraaf, and Trouw
Dutch public broadcasting
uitzending gemist
Government web service
E.g., belastingdienst
Discussion
Our original goal was to identify historically
trending events from the link evolution
recorded in the archive
Unfortunately we found only few examples
with our current analysis
E.g., ‘‘canon’’ *
However, important anchor text provides and
overview of important Dutch entities
* corresponding to an activity initiated by the government to define
the canonical historic events in Dutch history
Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]
References
[Masanés06] J. Masanés. Web Archiving. Springer, 2006
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai.
Title language model for information retrieval. In SIGIR 2002
Eiron & McCurley Nadav Eiron and Kevin S. McCurley. Analysis of
anchor text for web search. In SIGIR 2003
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
Limitations & Future Work
Exact text matching between anchor text and
WP title
E.g., filmpje does not match WP title filmpje!
Additional pre-processing
Stemming, stopping, generalize from exact match to
match with low edit distance
Our analysis is based on depth-first crawl of
few thousand of Dutch websites
Breadth-first crawl such as [CommonCrawl]