34
ALEXANDRIA - Analysing and Exploring Web Archives Elena Demidova L3S Research Center, Hannover RESAW Seminar, London, December 2014 This presentation contains contributions from Helge Holzmann, Avishek Anand, Mohammad Alrifai, Thomas Risse and Wolfgang Nejdl. http://alexandria-project.eu 1

ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

ALEXANDRIA -

Analysing and Exploring Web Archives

Elena Demidova

L3S Research Center, Hannover

RESAW Seminar, London, December 2014

This presentation contains contributions from Helge Holzmann, Avishek

Anand, Mohammad Alrifai, Thomas Risse and Wolfgang Nejdl.

http://alexandria-project.eu

1

Page 2: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The (incomplete) World of Web Archives

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

2

Web Archive initiative

• Historians

• Social Sciences

• Journalism

• Law

• Historians

• Social Sciences

• Journalism

• Law

Page 3: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Web Archives

2 are interesting temporal collections

• good documentation of entities, topics, events, societies, etc.

• direct access to official communications, news publications and the public view

2 for many disciplines

• Large scale validation of hypothesis• Large scale validation of hypothesis

• Derivation and development of new hypothesis and theories

2 but rise many challenges

• huge amounts of unstructured data

• incomplete (e.g. many dead links)

• incoherent (e.g. linked pages are crawled at different times)

• challenging syntax (e.g. erroneous HTML) and semantic (e.g. endless and senseless sentences)

• restricted usage (outside of Alexandria)

3

Page 4: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

What are the User Needs?

On a high level

• Searching

• Browsing

• Analysis of “something”

• Visualization• Visualization

But a “one size fits all” approach is not possible

Every research discipline has different needs

We need to learn what they are by

• Developing basic technologies for

temporal access and analytics

• Providing initial tools to foster the discussion

4

Top-100 most controversial Wikipedia

articles in English and German

Page 5: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The Alexandria Project

Motivation

Optimal access to Web archives requires new models and algorithms for retrieval, exploration, and analytics, taking into account

• the temporal dimension of Web archives

• structured semantic information available on the Web

• social media and network information

Objectives

• Evolution-Aware Entity-Based Enrichment and Indexing

• Aggregating Social Networks and Streams

• Temporal Retrieval and Ranking

• Collaborative Exploration and Analytics

Project runtime: 03/2014 – 02/2019

5

Page 6: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The Alexandria Project

Entity

Resolut

ion &

Evolutio

n

Time-AwareEntity Graph

t4t3t2t1

tnow

t2t3t4t

t1

6

WebWebWebWeb

Web

Social Networks & Streams

Linked Open Data Cloud

REvo

luti

Web Archive& Indext4

t3t2t1

tnow

t4tnow

Time- and Entity-Based Retrieval

1

2

3

4

6

7Aggregation

&Time-AwareIndexing

Entity

Linking 5

Improvement

Enrichment

complex query

Collaborative Exploration & Analytics

Page 7: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The Research Environment

Datasets

• German Web Crawl of the Internet Archive

• German Academic Web snapshots from L3S

• Twitter collection from L3S

• Wikipedia• Wikipedia

• Various Linked Data sources like DBpedia, Freebase, YAGO

Technical Infrastructure

• Apache Hadoop implementing Map Reduce for parallel processing

of large data sets.

• Apache Hbase - the Hadoop database, a distributed, scalable, big

data store.

• Apache Pig - a platform for analysing large data sets7

Page 8: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Web Archives as Data Set

Enables studying of long-term changes and evolutions

8

1997

2000

2006

2009

2014

Page 9: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Today: First Alexandria Studies

• Quantitative analysis of the German Internet Archive Dataset

• WikiTimes: Temporal event-centric database based on Wikipedia

• Entity name evolution on Wikipedia• Entity name evolution on Wikipedia

9

Page 10: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The Dawn of Today’s Popular Domains

A Study on 18 Years of the Archived German WebA Study on 18 Years of the Archived German Web

10

Helge Holzmann, Wolfgang Nejdl, Avishek Anand. In press.

Page 11: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

How does the Web change and evolve?

• Is the Web really growing old and if so, how can we

characterize it?

• How has the size of web pages changed over time?

• Do websites from different categories (like business,

universities and technology) have different growth rates?

11

Page 12: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

The German Internet Archive Dataset

18 years (1996 – 2013) of web data from the German (.de) domain

courtesy The Internet Archive

Approx. 80TB data (w/o duplicates) + CDX Index

11,174, 079 domains spanning > 2 years

The Popular German Web

Subset: 100 most popular German (.de) domains from 17 categories of

Alexa rankings

12

Page 13: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Domain Emergence

How many of the today‘s popular domains were known in a certain year?

13

,%

Page 14: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

URL Age Evolution

Average age of the Web is increasing almost linear

Mostly caused by the long-living pages (e.g. in Universities domain)

14

Page 15: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

URL Age Distribution (Normalized)

Normalized by the number of URLs in a year

Almost 70% of URLs are younger than a year at any time

15

Page 16: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Evolution of the Web's URL Volume

Growth = # Born - # Died

# Died URLs is rather constant, # Born URLs is growing

16

Page 17: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

URL Size Evolution of Alive Sites

Linear growth of size over the years (content and/or markup)

A major growth in size is contributed by new born URLs

17

Page 18: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Web Dynamics Conclusions

• Extensive longitudinal study on 18 years of the popular German Web

• Analyzed how popular domains of today have grown

• In terms of age / volume / size

• Popular educational domains have been around for very long

• Shopping and game websites mainly emerged during last decade

• The Web is actually getting older

• at least the old part of it

• Domains grow exponentially

• doubling their volume every two years

• Tomorrow’s newborn URLs will be bigger than today

• resource planning and allocation, e.g., for Web archives

18

Page 19: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

WikiTimes

A Knowledge Base of News Events with Daily SummariesA Knowledge Base of News Events with Daily Summaries

19

Giang Binh Tran and Mohammad Alrifai. 2014. Indexing and analyzing wikipedia's

current events portal, the daily news summaries by the crowd. In Proceedings of

the WWW Companion '14.

Page 20: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Wikipedia Current

Events Portal

� date

� daily updates

• category

• links to a story • links to a story

• links to news

• locations

20

Page 21: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

News Stories on Wikipedia

21

Page 22: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

WikiTimes System

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

22

Page 23: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

WikiTimes Timelines

23

Page 24: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Next steps: creating event-centric collections from the Web

• WikiTimes contains about 900 stories from past 15 years and is

growing.

• We intend to create event-centric collections using WikiTimes

story timelines and a focused Web crawler

• Interesting scenarios from research perspective(s)?• Interesting scenarios from research perspective(s)?

24

Page 25: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Insights into Entity Name Evolution

on Wikipedia

25

Helge Holzmann, Thomas Risse. Web Information Systems

Engineering – WISE 2014.

Page 26: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

� Entities evolve over time

� Names

� e.g., Leningrad to Saint Petersburg; Jorge Mario Bergoglio to Pope Francis

� Relations

� e.g., Brad Pitt – Jennifer Aniston to Angelina Jolie

� Roles

� e.g., Obama - from Senator to President; President - from Bush to Obama

Introduction: Entity Evolution

� e.g., Obama - from Senator to President; President - from Bush to Obama

26

� Challenges for Information Retrieval in Web Archives

� String Matching

� Names in queries do not match the name in texts anymore

� Entity Search – Finding entities that have evolved over time

� Evolution rarely covered in knowledge bases (DBpedia, YAGO, Freebase,

etc.)

t1914 1924 1991 today

LeningradPetrogradSt. Petersburg St. PetersburgSt. Petersburg

Page 27: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

� According to Wikipedia guidelines:

� In case of a name change, new article, redirect from former name

� Reason is missing, only explanation in revision history (sometimes)

� Former names should be mentioned at the beginning of an article

� E.g., Thamesdown → Swindon (1997)

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Handling of Name Changes on Wikipedia

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Council, with its council becoming a new unitary authority. It adopted the name

Swindon on 24 April 1997. The former Thamesdown name and logo are still

used by the main local bus company of Swindon, called Thamesdown

Transport Limited.”

27

Research Questions:

•Do excerpts of limited length exist in articles that are dedicated to name changes?

•How many sentences do excerpts span that cover former name, new name and date of

change?

Page 28: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Handling of Name Changes on Wikipedia

28

� List pages provide semi-structured information

� e.g., List of city name changes: “Edo → Tokyo (1868)”

� Served as a starting point for our analysis

Page 29: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

� Dataset

� 19 semi-structured seed lists (9 redundant) ���� 10 remaining (only geo names):

� Geographical renaming

� List of city name changes

� List of administrative division name changes

� 7 lists dedicated to renamings of cities in certain countries

� 1,926 distinct entities

Analysis

� 1,926 distinct entities

� 2,852 name changes

� 2,782 articles (of 1,898 entities)

� 766 entities with names resolvable to multiple articles

� 28 entities could not be resolved

29

� Most changes: 11 (Plovdiv, Bulgaria)

� Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis →

Trimontium → Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva →

Filibe → Plovdiv

� In average: 1.48 changes, 2.39 different names

Page 30: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

1. From evolution lists to excerpts

Results

• 696 entities remaining with 918 name changes annotated with dates

30

� 572 complete names changes are mentioned in articles

(preceding, succeeding name and date)

� 62.3% of the 918 considered name changes

Page 31: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

2. Analyzing excerpts

• Sentence distance

Results

• More than 85% of the 572 considered changes are completely mentioned

in excerpts with 10 sentences or less

• More than two-thirds of the found changes have a sentence

distance of less than 3 (excerpts spanning 3 sentences or less)

→ There are (short) passages in Wikipedia articles dedicated to describing

an entity’s evolution!

31

Page 32: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

� So far results only for geographic entities

� Manually parsed “List of renamed products” (unstructured)

� Same analysis

Generalization

� Similar results

� 80% of name changes reported in articles (vs. 62.3%)

32

� 80% of name changes reported in articles (vs. 62.3%)

� 91.7% span 10 sentences or less (vs. 85.3%)

� Again more than two-thirds of the changes have a sentence distance < 3

� 79.7% vs. 66.7%

� Assumption: Similar text patterns

� Enables automatic classification / detection of evolutions

� Demo @ Digital Libraries 2014

� http://evobase.l3s.de/DL2014_demo

Page 33: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Generalization: Evolution Base

33

http://evobase.l3s.de/DL2014_demo/Accenture

Page 34: ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware Entity Graph t 4 t 3 t 2 t 1 t now t t 2 t 3 t 4 t 1 6 WebWeb WebWeb Web Social Networks

Conclusion

We are making the first steps to, e.g.:

• Analyse the content of the archives

• Develop the algorithms to get better entity- and event-centric

access access

• Create knowledge bases and collections to support these

algorithms

A long way to go2

Questions?

Thank you

for your attention!

http://alexandria-project.eu

34