ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware...

Preview:

Citation preview

ALEXANDRIA -

Analysing and Exploring Web Archives

Elena Demidova

L3S Research Center, Hannover

RESAW Seminar, London, December 2014

This presentation contains contributions from Helge Holzmann, Avishek

Anand, Mohammad Alrifai, Thomas Risse and Wolfgang Nejdl.

http://alexandria-project.eu

1

The (incomplete) World of Web Archives

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

2

Web Archive initiative

• Historians

• Social Sciences

• Journalism

• Law

• Historians

• Social Sciences

• Journalism

• Law

Web Archives

2 are interesting temporal collections

• good documentation of entities, topics, events, societies, etc.

• direct access to official communications, news publications and the public view

2 for many disciplines

• Large scale validation of hypothesis• Large scale validation of hypothesis

• Derivation and development of new hypothesis and theories

2 but rise many challenges

• huge amounts of unstructured data

• incomplete (e.g. many dead links)

• incoherent (e.g. linked pages are crawled at different times)

• challenging syntax (e.g. erroneous HTML) and semantic (e.g. endless and senseless sentences)

• restricted usage (outside of Alexandria)

3

What are the User Needs?

On a high level

• Searching

• Browsing

• Analysis of “something”

• Visualization• Visualization

But a “one size fits all” approach is not possible

Every research discipline has different needs

We need to learn what they are by

• Developing basic technologies for

temporal access and analytics

• Providing initial tools to foster the discussion

4

Top-100 most controversial Wikipedia

articles in English and German

The Alexandria Project

Motivation

Optimal access to Web archives requires new models and algorithms for retrieval, exploration, and analytics, taking into account

• the temporal dimension of Web archives

• structured semantic information available on the Web

• social media and network information

Objectives

• Evolution-Aware Entity-Based Enrichment and Indexing

• Aggregating Social Networks and Streams

• Temporal Retrieval and Ranking

• Collaborative Exploration and Analytics

Project runtime: 03/2014 – 02/2019

5

The Alexandria Project

Entity

Resolut

ion &

Evolutio

n

Time-AwareEntity Graph

t4t3t2t1

tnow

t2t3t4t

t1

6

WebWebWebWeb

Web

Social Networks & Streams

Linked Open Data Cloud

REvo

luti

Web Archive& Indext4

t3t2t1

tnow

t4tnow

Time- and Entity-Based Retrieval

1

2

3

4

6

7Aggregation

&Time-AwareIndexing

Entity

Linking 5

Improvement

Enrichment

complex query

Collaborative Exploration & Analytics

The Research Environment

Datasets

• German Web Crawl of the Internet Archive

• German Academic Web snapshots from L3S

• Twitter collection from L3S

• Wikipedia• Wikipedia

• Various Linked Data sources like DBpedia, Freebase, YAGO

Technical Infrastructure

• Apache Hadoop implementing Map Reduce for parallel processing

of large data sets.

• Apache Hbase - the Hadoop database, a distributed, scalable, big

data store.

• Apache Pig - a platform for analysing large data sets7

Web Archives as Data Set

Enables studying of long-term changes and evolutions

8

1997

2000

2006

2009

2014

Today: First Alexandria Studies

• Quantitative analysis of the German Internet Archive Dataset

• WikiTimes: Temporal event-centric database based on Wikipedia

• Entity name evolution on Wikipedia• Entity name evolution on Wikipedia

9

The Dawn of Today’s Popular Domains

A Study on 18 Years of the Archived German WebA Study on 18 Years of the Archived German Web

10

Helge Holzmann, Wolfgang Nejdl, Avishek Anand. In press.

How does the Web change and evolve?

• Is the Web really growing old and if so, how can we

characterize it?

• How has the size of web pages changed over time?

• Do websites from different categories (like business,

universities and technology) have different growth rates?

11

The German Internet Archive Dataset

18 years (1996 – 2013) of web data from the German (.de) domain

courtesy The Internet Archive

Approx. 80TB data (w/o duplicates) + CDX Index

11,174, 079 domains spanning > 2 years

The Popular German Web

Subset: 100 most popular German (.de) domains from 17 categories of

Alexa rankings

12

Domain Emergence

How many of the today‘s popular domains were known in a certain year?

13

,%

URL Age Evolution

Average age of the Web is increasing almost linear

Mostly caused by the long-living pages (e.g. in Universities domain)

14

URL Age Distribution (Normalized)

Normalized by the number of URLs in a year

Almost 70% of URLs are younger than a year at any time

15

Evolution of the Web's URL Volume

Growth = # Born - # Died

# Died URLs is rather constant, # Born URLs is growing

16

URL Size Evolution of Alive Sites

Linear growth of size over the years (content and/or markup)

A major growth in size is contributed by new born URLs

17

Web Dynamics Conclusions

• Extensive longitudinal study on 18 years of the popular German Web

• Analyzed how popular domains of today have grown

• In terms of age / volume / size

• Popular educational domains have been around for very long

• Shopping and game websites mainly emerged during last decade

• The Web is actually getting older

• at least the old part of it

• Domains grow exponentially

• doubling their volume every two years

• Tomorrow’s newborn URLs will be bigger than today

• resource planning and allocation, e.g., for Web archives

18

WikiTimes

A Knowledge Base of News Events with Daily SummariesA Knowledge Base of News Events with Daily Summaries

19

Giang Binh Tran and Mohammad Alrifai. 2014. Indexing and analyzing wikipedia's

current events portal, the daily news summaries by the crowd. In Proceedings of

the WWW Companion '14.

Wikipedia Current

Events Portal

� date

� daily updates

• category

• links to a story • links to a story

• links to news

• locations

20

News Stories on Wikipedia

21

WikiTimes System

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

22

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

WikiTimes Timelines

23

Next steps: creating event-centric collections from the Web

• WikiTimes contains about 900 stories from past 15 years and is

growing.

• We intend to create event-centric collections using WikiTimes

story timelines and a focused Web crawler

• Interesting scenarios from research perspective(s)?• Interesting scenarios from research perspective(s)?

24

Insights into Entity Name Evolution

on Wikipedia

25

Helge Holzmann, Thomas Risse. Web Information Systems

Engineering – WISE 2014.

� Entities evolve over time

� Names

� e.g., Leningrad to Saint Petersburg; Jorge Mario Bergoglio to Pope Francis

� Relations

� e.g., Brad Pitt – Jennifer Aniston to Angelina Jolie

� Roles

� e.g., Obama - from Senator to President; President - from Bush to Obama

Introduction: Entity Evolution

� e.g., Obama - from Senator to President; President - from Bush to Obama

26

� Challenges for Information Retrieval in Web Archives

� String Matching

� Names in queries do not match the name in texts anymore

� Entity Search – Finding entities that have evolved over time

� Evolution rarely covered in knowledge bases (DBpedia, YAGO, Freebase,

etc.)

t1914 1924 1991 today

LeningradPetrogradSt. Petersburg St. PetersburgSt. Petersburg

� According to Wikipedia guidelines:

� In case of a name change, new article, redirect from former name

� Reason is missing, only explanation in revision history (sometimes)

� Former names should be mentioned at the beginning of an article

� E.g., Thamesdown → Swindon (1997)

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Handling of Name Changes on Wikipedia

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Council, with its council becoming a new unitary authority. It adopted the name

Swindon on 24 April 1997. The former Thamesdown name and logo are still

used by the main local bus company of Swindon, called Thamesdown

Transport Limited.”

27

Research Questions:

•Do excerpts of limited length exist in articles that are dedicated to name changes?

•How many sentences do excerpts span that cover former name, new name and date of

change?

Handling of Name Changes on Wikipedia

28

� List pages provide semi-structured information

� e.g., List of city name changes: “Edo → Tokyo (1868)”

� Served as a starting point for our analysis

� Dataset

� 19 semi-structured seed lists (9 redundant) ���� 10 remaining (only geo names):

� Geographical renaming

� List of city name changes

� List of administrative division name changes

� 7 lists dedicated to renamings of cities in certain countries

� 1,926 distinct entities

Analysis

� 1,926 distinct entities

� 2,852 name changes

� 2,782 articles (of 1,898 entities)

� 766 entities with names resolvable to multiple articles

� 28 entities could not be resolved

29

� Most changes: 11 (Plovdiv, Bulgaria)

� Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis →

Trimontium → Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva →

Filibe → Plovdiv

� In average: 1.48 changes, 2.39 different names

1. From evolution lists to excerpts

Results

• 696 entities remaining with 918 name changes annotated with dates

30

� 572 complete names changes are mentioned in articles

(preceding, succeeding name and date)

� 62.3% of the 918 considered name changes

2. Analyzing excerpts

• Sentence distance

Results

• More than 85% of the 572 considered changes are completely mentioned

in excerpts with 10 sentences or less

• More than two-thirds of the found changes have a sentence

distance of less than 3 (excerpts spanning 3 sentences or less)

→ There are (short) passages in Wikipedia articles dedicated to describing

an entity’s evolution!

31

� So far results only for geographic entities

� Manually parsed “List of renamed products” (unstructured)

� Same analysis

Generalization

� Similar results

� 80% of name changes reported in articles (vs. 62.3%)

32

� 80% of name changes reported in articles (vs. 62.3%)

� 91.7% span 10 sentences or less (vs. 85.3%)

� Again more than two-thirds of the changes have a sentence distance < 3

� 79.7% vs. 66.7%

� Assumption: Similar text patterns

� Enables automatic classification / detection of evolutions

� Demo @ Digital Libraries 2014

� http://evobase.l3s.de/DL2014_demo

Generalization: Evolution Base

33

http://evobase.l3s.de/DL2014_demo/Accenture

Conclusion

We are making the first steps to, e.g.:

• Analyse the content of the archives

• Develop the algorithms to get better entity- and event-centric

access access

• Create knowledge bases and collections to support these

algorithms

A long way to go2

Questions?

Thank you

for your attention!

http://alexandria-project.eu

34

Recommended