View
213
Download
0
Category
Tags:
Preview:
Citation preview
Information Retrievalis the activity of obtaining information resources relevant to an information need from a collection of information resources.”.
Wikipedia, en.wikipedia.org/wiki/Information_retrieval
Collecting information needs (methods)
[03/02/2012 21:16:11] QUERY fcul
[03/02/2012 21:16:19] CLICK RANK=1
Laboratory studies Online questionnaires
Search log mining
Classes of web archive information needs
• Navigational – 53% to 81%– Seeing a web page in the past or how it evolved– How was the home page of the Library of
Congress in 1996?
• Informational – 14% to 38%– Collecting information about a topic written in the
past– How many people worked at the Library of Congress
in 1996?
• Transactional – 5% to 16%– Downloading an old file or recovering a site from the
past– Make a copy of the site of Library of Congress in
1996.
Conducted 2 surveys in 2010 and 2014
Questionnaires and public information.
List of Web archiving initiatives, Wikipedia
Data: significant amount archived by web archiving initiatives (2014)
>68 initiatives across 33 countries>534 billions of web-archived files since 1996 (17 PB)
Access method: URL search
• Technology based on the Wayback Machine. • URLs from the past that contain the desired
information may be unknown to the users.• The URLs for the same site change across time.• <lcweb.loc.gov, 1996> vs. <www.loc.gov, 2015>
Access method: Full-text search
• Technology based on Lucene extensions (NutchWAX & Solr).
• Lucene was not designed for Web Archive Information Retrieval • Low relevance of search results
How to obtain relevantinformation resourcesthrough full-text search?
Our challenge!
Web Archive Information Retrieval
A Full-Text Index is like a huge Book GlossaryTerm………..Archived Web Pages (millions)“Library”………….<www.loc.gov, 2009>, <www.bl.uk, 2008>,
<www.nlm.nih.gov, 2010>, <www.ala.org, 2005>“Congress”……….<lcweb.loc.gov,
2009>,<www.greencarcongress.com, 2011>, <www.loc.gov, 2009>, <eawop-congress.iscte.pt, 2003>
Information Retrieval• Search for archived pages that contain given
words• Millions of web pages that contain the searched
words.Web Archive Information Retrieval• Archived pages are spread across time.
Which archived pages should be presented on the top results?
How to rank the relevance of temporal search results?
Previous studies showed that temporal information:• has been exploited to improve IR
systems.• can be extracted from web archives.
Hypothesis: Improve WAIR systems by exploiting temporal information intrinsic to web archives through Machine Learning (Learn to Rank).
Ranking relevance in WAIR
Methodology: combine ranking features using learn-to-rank (L2R) methods into a ranking model
Ranking features• f(q,d): frequency of
query term q on text of document d
• i(d): number of inlinks to document d
• …
Test collection• Documents• Topics• Relevance
judgments
Best ranking model• r(f(q,d), i(d))
=0.6 x f(q,d) +0.4 x i(d)
L2R method• L2R
algorithms
Including temporal features.
Automatically assigned weights for each ranking feature.
Created Test collection through Cranfield Paradigm
• Documents – 255 million documents (8.9TB)– 6 archived-web collections: broad crawls,
selective crawls, integrated collections– 14 years time span (1996-2010)
• Topics– 50 navigational needs (with date range)– e.g. the page of Publico newspaper before 2000
• Relevance judgments– 3 judges, 3-level scale of relevance, 267 822
versions assessed– 5-fold cross-validation: 3 folders for training, 1
for validation, 1 for testing
L2R method
• 3 L2R algorithms– AdaRank, Random Forests, RankSVM.
• 6 relevance metrics– NDCG@k | Precision@k: k=1,5,10
• 68 ranking features– IR ranking features (e.g. Lucene, BM25,
TF-IDF)– New temporal features
Temporal ranking features
Intuition:
Persistent documents are more relevant for navigational queries.
Lifespan is correlated to relevance.
documents with higherrelevance tend to have
a longer lifespan
14 years of web snapshots analyzed
0
20
40
60
80
100
relevance level
frac
tion
of d
ocum
ents
with
a
lifes
pan
long
er th
an 1
yea
r
Number of versions is correlated to relevance.
documents with higherrelevance tend to have
more versions
0
20
40
60
80
100
relevance level
frac
tion
of d
ocum
ents
with
mor
e th
an 1
0 ve
rsio
ns
14 years of web snapshots analyzed
L2R with temporal features improves search relevance
Relevancemetric
Lucene NutchWAX
L2R without temporal features (Random Forests method)
L2R with temporal features (Random Forests method)
Precision gain of L2R with temporal features over Lucene
Precision@1
0.28 0.32 0.64 0.76 270%
Precision@5
0.16 0.24 0.39 0.4 250%
Precision@10
0.13 0.17 0.24 0.24 184%
Most important ranking features (initial selection)
We do not use them in Arquivo.pt yet. We use:• 4 ranking features derived from TREC
collection.• 4 fields (URL, title, text body and anchor
text of incoming links).
Temporal ranking model
Intuition:
Several ranking models learned for specific periods of time are more effective than a single ranking model.
Why several ranking models for WAIR?
• The web presents different characteristics across time– Technology– Structure and size of pages– Structure and size of sites– Language– Web-graph structure– Web 1996 vs. Web 2015?
• Should we use a single-model that fits all data?– No: [Kang & Kim 2003; Geng et al. 2008; Bian et al.
2010]
Several ranking models yield more relevant results than a single ranking model with temporal features
• Best results with 4 ranking models for 14 years
14 7 4 2 10.510.520.530.540.550.560.570.580.590.6
0.61
Nr. of ranking models for 14 years timespan of archived collections
Relevance NDCG@10
+3.3%
Conclusions1. Web archive users
– First navigational needs, then informational needs.– Search as in web search engines.– Prefer full-text search and older documents.
2. WAIR systems– WAIR must be optimized for the specific needs of web
archive users.– Some needs are not supported by current Information
Retrieval technology.
3. Improving WAIR systems– Relevant documents tend to have a longer lifespan and
more versions.– WAIR systems can be improved by exploiting ranking
features intrinsic to web archives using Machine Learning (L2R).
– Ranking models per time period outperform a single ranking model.
Future (hard) work
• Feature selection– We do not have our best ranking function in
production– Must choose features attainable to implement
considering the required resources (CPU, memory, processing time, impact on response speed)
• Create test collection for informational needs– Who was the president of the USA in 2007?
• Image search– A “must have” from users
• We need your collaboration!
All our resources are free open source• All code available under the LGPL license:
– https://code.google.com/p/pwa-technologies/• Test collection to support evaluation:
– https://code.google.com/p/pwa-technologies/wiki/TestCollection
• L2R dataset for WAIR research:– http://
code.google.com/p/pwa-technologies/wiki/L2R4WAIR• Search service since 2010:
– http://archive.pt• OpenSearch API:
– http://code.google.com/p/pwa-technologies/wiki/OpenSearch
• Google Code:Pwa_technologies will migrate to GitHub:FullPast project.
Publications• A Survey on Web Archiving Initiatives, Gomes et al., 2011.
• Understanding the Information Needs of Web Archive Users, Costa & Silva, 2010.
• Characterizing Search Behavior in Web Archives, Costa & Silva, 2011.
• A Search Log Analysis of a Portuguese Web Search Engine, Costa & Silva, 2010.
• Evaluating Web Archive Search Systems, Costa & Silva, 2012.
• Towards Information Retrieval Evaluation over Web Archives, Costa & Silva, 2009.
• Learning Temporal-Dependent Ranking Models, Costa et al., 2014.
• Creating a Billion-Scale Searchable Web Archive, Gomes et al., 2013.
• Information Search in Web Archives, Miguel Costa, PhD Thesis, 2014.
Recommended