Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
ALEXANDRIA -
Analysing and Exploring Web Archives
Elena Demidova
L3S Research Center, Hannover
RESAW Seminar, London, December 2014
This presentation contains contributions from Helge Holzmann, Avishek
Anand, Mohammad Alrifai, Thomas Risse and Wolfgang Nejdl.
http://alexandria-project.eu
1
The (incomplete) World of Web Archives
Web Archives are an important part of our culture
2 but they are underused.
Recently an increasing interest can be observed
• Historians
Web Archives are an important part of our culture
2 but they are underused.
Recently an increasing interest can be observed
• Historians
2
Web Archive initiative
• Historians
• Social Sciences
• Journalism
• Law
• Historians
• Social Sciences
• Journalism
• Law
Web Archives
2 are interesting temporal collections
• good documentation of entities, topics, events, societies, etc.
• direct access to official communications, news publications and the public view
2 for many disciplines
• Large scale validation of hypothesis• Large scale validation of hypothesis
• Derivation and development of new hypothesis and theories
2 but rise many challenges
• huge amounts of unstructured data
• incomplete (e.g. many dead links)
• incoherent (e.g. linked pages are crawled at different times)
• challenging syntax (e.g. erroneous HTML) and semantic (e.g. endless and senseless sentences)
• restricted usage (outside of Alexandria)
3
What are the User Needs?
On a high level
• Searching
• Browsing
• Analysis of “something”
• Visualization• Visualization
But a “one size fits all” approach is not possible
Every research discipline has different needs
We need to learn what they are by
• Developing basic technologies for
temporal access and analytics
• Providing initial tools to foster the discussion
4
Top-100 most controversial Wikipedia
articles in English and German
The Alexandria Project
Motivation
Optimal access to Web archives requires new models and algorithms for retrieval, exploration, and analytics, taking into account
• the temporal dimension of Web archives
• structured semantic information available on the Web
• social media and network information
Objectives
• Evolution-Aware Entity-Based Enrichment and Indexing
• Aggregating Social Networks and Streams
• Temporal Retrieval and Ranking
• Collaborative Exploration and Analytics
Project runtime: 03/2014 – 02/2019
5
The Alexandria Project
Entity
Resolut
ion &
Evolutio
n
Time-AwareEntity Graph
t4t3t2t1
tnow
t2t3t4t
t1
6
WebWebWebWeb
Web
Social Networks & Streams
Linked Open Data Cloud
REvo
luti
Web Archive& Indext4
t3t2t1
tnow
t4tnow
Time- and Entity-Based Retrieval
1
2
3
4
6
7Aggregation
&Time-AwareIndexing
Entity
Linking 5
Improvement
Enrichment
complex query
Collaborative Exploration & Analytics
The Research Environment
Datasets
• German Web Crawl of the Internet Archive
• German Academic Web snapshots from L3S
• Twitter collection from L3S
• Wikipedia• Wikipedia
• Various Linked Data sources like DBpedia, Freebase, YAGO
Technical Infrastructure
• Apache Hadoop implementing Map Reduce for parallel processing
of large data sets.
• Apache Hbase - the Hadoop database, a distributed, scalable, big
data store.
• Apache Pig - a platform for analysing large data sets7
Web Archives as Data Set
Enables studying of long-term changes and evolutions
8
1997
2000
2006
2009
2014
Today: First Alexandria Studies
• Quantitative analysis of the German Internet Archive Dataset
• WikiTimes: Temporal event-centric database based on Wikipedia
• Entity name evolution on Wikipedia• Entity name evolution on Wikipedia
9
The Dawn of Today’s Popular Domains
A Study on 18 Years of the Archived German WebA Study on 18 Years of the Archived German Web
10
Helge Holzmann, Wolfgang Nejdl, Avishek Anand. In press.
How does the Web change and evolve?
• Is the Web really growing old and if so, how can we
characterize it?
• How has the size of web pages changed over time?
• Do websites from different categories (like business,
universities and technology) have different growth rates?
11
The German Internet Archive Dataset
18 years (1996 – 2013) of web data from the German (.de) domain
courtesy The Internet Archive
Approx. 80TB data (w/o duplicates) + CDX Index
11,174, 079 domains spanning > 2 years
The Popular German Web
Subset: 100 most popular German (.de) domains from 17 categories of
Alexa rankings
12
Domain Emergence
How many of the today‘s popular domains were known in a certain year?
13
,%
URL Age Evolution
Average age of the Web is increasing almost linear
Mostly caused by the long-living pages (e.g. in Universities domain)
14
URL Age Distribution (Normalized)
Normalized by the number of URLs in a year
Almost 70% of URLs are younger than a year at any time
15
Evolution of the Web's URL Volume
Growth = # Born - # Died
# Died URLs is rather constant, # Born URLs is growing
16
URL Size Evolution of Alive Sites
Linear growth of size over the years (content and/or markup)
A major growth in size is contributed by new born URLs
17
Web Dynamics Conclusions
• Extensive longitudinal study on 18 years of the popular German Web
• Analyzed how popular domains of today have grown
• In terms of age / volume / size
• Popular educational domains have been around for very long
• Shopping and game websites mainly emerged during last decade
• The Web is actually getting older
• at least the old part of it
• Domains grow exponentially
• doubling their volume every two years
• Tomorrow’s newborn URLs will be bigger than today
• resource planning and allocation, e.g., for Web archives
18
WikiTimes
A Knowledge Base of News Events with Daily SummariesA Knowledge Base of News Events with Daily Summaries
19
Giang Binh Tran and Mohammad Alrifai. 2014. Indexing and analyzing wikipedia's
current events portal, the daily news summaries by the crowd. In Proceedings of
the WWW Companion '14.
Wikipedia Current
Events Portal
� date
� daily updates
• category
• links to a story • links to a story
• links to news
• locations
20
News Stories on Wikipedia
21
WikiTimes System
http://wikitimes.l3s.de
http://wikitimes.l3s.de/rdf/
22
http://wikitimes.l3s.de
http://wikitimes.l3s.de/rdf/
WikiTimes Timelines
23
Next steps: creating event-centric collections from the Web
• WikiTimes contains about 900 stories from past 15 years and is
growing.
• We intend to create event-centric collections using WikiTimes
story timelines and a focused Web crawler
• Interesting scenarios from research perspective(s)?• Interesting scenarios from research perspective(s)?
24
Insights into Entity Name Evolution
on Wikipedia
25
Helge Holzmann, Thomas Risse. Web Information Systems
Engineering – WISE 2014.
� Entities evolve over time
� Names
� e.g., Leningrad to Saint Petersburg; Jorge Mario Bergoglio to Pope Francis
� Relations
� e.g., Brad Pitt – Jennifer Aniston to Angelina Jolie
� Roles
� e.g., Obama - from Senator to President; President - from Bush to Obama
Introduction: Entity Evolution
� e.g., Obama - from Senator to President; President - from Bush to Obama
26
� Challenges for Information Retrieval in Web Archives
� String Matching
� Names in queries do not match the name in texts anymore
� Entity Search – Finding entities that have evolved over time
� Evolution rarely covered in knowledge bases (DBpedia, YAGO, Freebase,
etc.)
t1914 1924 1991 today
LeningradPetrogradSt. Petersburg St. PetersburgSt. Petersburg
� According to Wikipedia guidelines:
� In case of a name change, new article, redirect from former name
� Reason is missing, only explanation in revision history (sometimes)
� Former names should be mentioned at the beginning of an article
� E.g., Thamesdown → Swindon (1997)
� “On 1 April 1997 it was made administratively independent of Wiltshire County
Handling of Name Changes on Wikipedia
� “On 1 April 1997 it was made administratively independent of Wiltshire County
Council, with its council becoming a new unitary authority. It adopted the name
Swindon on 24 April 1997. The former Thamesdown name and logo are still
used by the main local bus company of Swindon, called Thamesdown
Transport Limited.”
27
Research Questions:
•Do excerpts of limited length exist in articles that are dedicated to name changes?
•How many sentences do excerpts span that cover former name, new name and date of
change?
Handling of Name Changes on Wikipedia
28
� List pages provide semi-structured information
� e.g., List of city name changes: “Edo → Tokyo (1868)”
� Served as a starting point for our analysis
� Dataset
� 19 semi-structured seed lists (9 redundant) ���� 10 remaining (only geo names):
� Geographical renaming
� List of city name changes
� List of administrative division name changes
� 7 lists dedicated to renamings of cities in certain countries
� 1,926 distinct entities
Analysis
� 1,926 distinct entities
� 2,852 name changes
� 2,782 articles (of 1,898 entities)
� 766 entities with names resolvable to multiple articles
� 28 entities could not be resolved
29
� Most changes: 11 (Plovdiv, Bulgaria)
� Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis →
Trimontium → Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva →
Filibe → Plovdiv
� In average: 1.48 changes, 2.39 different names
1. From evolution lists to excerpts
Results
• 696 entities remaining with 918 name changes annotated with dates
30
� 572 complete names changes are mentioned in articles
(preceding, succeeding name and date)
� 62.3% of the 918 considered name changes
2. Analyzing excerpts
• Sentence distance
Results
• More than 85% of the 572 considered changes are completely mentioned
in excerpts with 10 sentences or less
• More than two-thirds of the found changes have a sentence
distance of less than 3 (excerpts spanning 3 sentences or less)
→ There are (short) passages in Wikipedia articles dedicated to describing
an entity’s evolution!
31
� So far results only for geographic entities
� Manually parsed “List of renamed products” (unstructured)
� Same analysis
Generalization
� Similar results
� 80% of name changes reported in articles (vs. 62.3%)
32
� 80% of name changes reported in articles (vs. 62.3%)
� 91.7% span 10 sentences or less (vs. 85.3%)
� Again more than two-thirds of the changes have a sentence distance < 3
� 79.7% vs. 66.7%
� Assumption: Similar text patterns
� Enables automatic classification / detection of evolutions
� Demo @ Digital Libraries 2014
� http://evobase.l3s.de/DL2014_demo
Generalization: Evolution Base
33
http://evobase.l3s.de/DL2014_demo/Accenture
Conclusion
We are making the first steps to, e.g.:
• Analyse the content of the archives
• Develop the algorithms to get better entity- and event-centric
access access
• Create knowledge bases and collections to support these
algorithms
A long way to go2
Questions?
Thank you
for your attention!
http://alexandria-project.eu
34