ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware...

ALEXANDRIA -

Analysing and Exploring Web Archives

Elena Demidova

L3S Research Center, Hannover

RESAW Seminar, London, December 2014

This presentation contains contributions from Helge Holzmann, Avishek

Anand, Mohammad Alrifai, Thomas Risse and Wolfgang Nejdl.

http://alexandria-project.eu

The (incomplete) World of Web Archives

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

Web Archives are an important part of our culture

2 but they are underused.

Recently an increasing interest can be observed

• Historians

Web Archive initiative

• Historians

• Social Sciences

• Journalism

• Law

• Historians

• Social Sciences

• Journalism

• Law

Web Archives

2 are interesting temporal collections

• good documentation of entities, topics, events, societies, etc.

• direct access to official communications, news publications and the public view

2 for many disciplines

• Large scale validation of hypothesis• Large scale validation of hypothesis

• Derivation and development of new hypothesis and theories

2 but rise many challenges

• huge amounts of unstructured data

• incomplete (e.g. many dead links)

• incoherent (e.g. linked pages are crawled at different times)

• challenging syntax (e.g. erroneous HTML) and semantic (e.g. endless and senseless sentences)

• restricted usage (outside of Alexandria)

What are the User Needs?

On a high level

• Searching

• Browsing

• Analysis of “something”

• Visualization• Visualization

But a “one size fits all” approach is not possible

Every research discipline has different needs

We need to learn what they are by

• Developing basic technologies for

temporal access and analytics

• Providing initial tools to foster the discussion

Top-100 most controversial Wikipedia

articles in English and German

The Alexandria Project

Motivation

Optimal access to Web archives requires new models and algorithms for retrieval, exploration, and analytics, taking into account

• the temporal dimension of Web archives

• structured semantic information available on the Web

• social media and network information

Objectives

• Evolution-Aware Entity-Based Enrichment and Indexing

• Aggregating Social Networks and Streams

• Temporal Retrieval and Ranking

• Collaborative Exploration and Analytics

Project runtime: 03/2014 – 02/2019

The Alexandria Project

Entity

Resolut

Evolutio

Time-AwareEntity Graph

t4t3t2t1

t2t3t4t

WebWebWebWeb

Social Networks & Streams

Linked Open Data Cloud

Web Archive& Indext4

t3t2t1

t4tnow

Time- and Entity-Based Retrieval

7Aggregation

&Time-AwareIndexing

Entity

Linking 5

Improvement

Enrichment

complex query

Collaborative Exploration & Analytics

The Research Environment

Datasets

• German Web Crawl of the Internet Archive

• German Academic Web snapshots from L3S

• Twitter collection from L3S

• Wikipedia• Wikipedia

• Various Linked Data sources like DBpedia, Freebase, YAGO

Technical Infrastructure

• Apache Hadoop implementing Map Reduce for parallel processing

of large data sets.

• Apache Hbase - the Hadoop database, a distributed, scalable, big

data store.

• Apache Pig - a platform for analysing large data sets7

Web Archives as Data Set

Enables studying of long-term changes and evolutions

Today: First Alexandria Studies

• Quantitative analysis of the German Internet Archive Dataset

• WikiTimes: Temporal event-centric database based on Wikipedia

• Entity name evolution on Wikipedia• Entity name evolution on Wikipedia

The Dawn of Today’s Popular Domains

A Study on 18 Years of the Archived German WebA Study on 18 Years of the Archived German Web

Helge Holzmann, Wolfgang Nejdl, Avishek Anand. In press.

How does the Web change and evolve?

• Is the Web really growing old and if so, how can we

characterize it?

• How has the size of web pages changed over time?

• Do websites from different categories (like business,

universities and technology) have different growth rates?

The German Internet Archive Dataset

18 years (1996 – 2013) of web data from the German (.de) domain

courtesy The Internet Archive

Approx. 80TB data (w/o duplicates) + CDX Index

11,174, 079 domains spanning > 2 years

The Popular German Web

Subset: 100 most popular German (.de) domains from 17 categories of

Alexa rankings

Domain Emergence

How many of the today‘s popular domains were known in a certain year?

URL Age Evolution

Average age of the Web is increasing almost linear

Mostly caused by the long-living pages (e.g. in Universities domain)

URL Age Distribution (Normalized)

Normalized by the number of URLs in a year

Almost 70% of URLs are younger than a year at any time

Evolution of the Web's URL Volume

Growth = # Born - # Died

# Died URLs is rather constant, # Born URLs is growing

URL Size Evolution of Alive Sites

Linear growth of size over the years (content and/or markup)

A major growth in size is contributed by new born URLs

Web Dynamics Conclusions

• Extensive longitudinal study on 18 years of the popular German Web

• Analyzed how popular domains of today have grown

• In terms of age / volume / size

• Popular educational domains have been around for very long

• Shopping and game websites mainly emerged during last decade

• The Web is actually getting older

• at least the old part of it

• Domains grow exponentially

• doubling their volume every two years

• Tomorrow’s newborn URLs will be bigger than today

• resource planning and allocation, e.g., for Web archives

WikiTimes

A Knowledge Base of News Events with Daily SummariesA Knowledge Base of News Events with Daily Summaries

Giang Binh Tran and Mohammad Alrifai. 2014. Indexing and analyzing wikipedia's

current events portal, the daily news summaries by the crowd. In Proceedings of

the WWW Companion '14.

Wikipedia Current

Events Portal

� date

� daily updates

• category

• links to a story • links to a story

• links to news

• locations

News Stories on Wikipedia

WikiTimes System

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

http://wikitimes.l3s.de

http://wikitimes.l3s.de/rdf/

WikiTimes Timelines

Next steps: creating event-centric collections from the Web

• WikiTimes contains about 900 stories from past 15 years and is

growing.

• We intend to create event-centric collections using WikiTimes

story timelines and a focused Web crawler

• Interesting scenarios from research perspective(s)?• Interesting scenarios from research perspective(s)?

Insights into Entity Name Evolution

on Wikipedia

Helge Holzmann, Thomas Risse. Web Information Systems

Engineering – WISE 2014.

� Entities evolve over time

� Names

� e.g., Leningrad to Saint Petersburg; Jorge Mario Bergoglio to Pope Francis

� Relations

� e.g., Brad Pitt – Jennifer Aniston to Angelina Jolie

� Roles

� e.g., Obama - from Senator to President; President - from Bush to Obama

Introduction: Entity Evolution

� e.g., Obama - from Senator to President; President - from Bush to Obama

� Challenges for Information Retrieval in Web Archives

� String Matching

� Names in queries do not match the name in texts anymore

� Entity Search – Finding entities that have evolved over time

� Evolution rarely covered in knowledge bases (DBpedia, YAGO, Freebase,

t1914 1924 1991 today

LeningradPetrogradSt. Petersburg St. PetersburgSt. Petersburg

� According to Wikipedia guidelines:

� In case of a name change, new article, redirect from former name

� Reason is missing, only explanation in revision history (sometimes)

� Former names should be mentioned at the beginning of an article

� E.g., Thamesdown → Swindon (1997)

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Handling of Name Changes on Wikipedia

� “On 1 April 1997 it was made administratively independent of Wiltshire County

Council, with its council becoming a new unitary authority. It adopted the name

Swindon on 24 April 1997. The former Thamesdown name and logo are still

used by the main local bus company of Swindon, called Thamesdown

Transport Limited.”

Research Questions:

•Do excerpts of limited length exist in articles that are dedicated to name changes?

•How many sentences do excerpts span that cover former name, new name and date of

change?

Handling of Name Changes on Wikipedia

� List pages provide semi-structured information

� e.g., List of city name changes: “Edo → Tokyo (1868)”

� Served as a starting point for our analysis

� Dataset

� 19 semi-structured seed lists (9 redundant) �� 10 remaining (only geo names):

� Geographical renaming

� List of city name changes

� List of administrative division name changes

� 7 lists dedicated to renamings of cities in certain countries

� 1,926 distinct entities

Analysis

� 1,926 distinct entities

� 2,852 name changes

� 2,782 articles (of 1,898 entities)

� 766 entities with names resolvable to multiple articles

� 28 entities could not be resolved

� Most changes: 11 (Plovdiv, Bulgaria)

� Kendros (Kendrisos/Kendrisia) → Odryssa → Eumolpia → Philipopolis →

Trimontium → Ulpia → Flavia → Julia → Paldin/Ploudin → Poulpoudeva →

Filibe → Plovdiv

� In average: 1.48 changes, 2.39 different names

1. From evolution lists to excerpts

Results

• 696 entities remaining with 918 name changes annotated with dates

� 572 complete names changes are mentioned in articles

(preceding, succeeding name and date)

� 62.3% of the 918 considered name changes

2. Analyzing excerpts

• Sentence distance

Results

• More than 85% of the 572 considered changes are completely mentioned

in excerpts with 10 sentences or less

• More than two-thirds of the found changes have a sentence

distance of less than 3 (excerpts spanning 3 sentences or less)

→ There are (short) passages in Wikipedia articles dedicated to describing

an entity’s evolution!

� So far results only for geographic entities

� Manually parsed “List of renamed products” (unstructured)

� Same analysis

Generalization

� Similar results

� 80% of name changes reported in articles (vs. 62.3%)

� 91.7% span 10 sentences or less (vs. 85.3%)

� Again more than two-thirds of the changes have a sentence distance < 3

� 79.7% vs. 66.7%

� Assumption: Similar text patterns

� Enables automatic classification / detection of evolutions

� Demo @ Digital Libraries 2014

� http://evobase.l3s.de/DL2014_demo

Generalization: Evolution Base

http://evobase.l3s.de/DL2014_demo/Accenture

Conclusion

We are making the first steps to, e.g.:

• Analyse the content of the archives

• Develop the algorithms to get better entity- and event-centric

access access

• Create knowledge bases and collections to support these

algorithms

A long way to go2

Questions?

Thank you

for your attention!

http://alexandria-project.eu

ALEXANDRIA - Analysing and Exploring Web Archives...Entity Resolutio n & Evolutio n Time-Aware...

Documents

Block-LDA: Jointly modeling entity-annotated text and entity-entity links

T ENTITY S 101 T ONCERNS, OWNER PAYMENTS CONVERSIONS Concerns with Choice … · Single Member Texas LLC a) Default Tax Status: Disregarded entity b) Election Options : C or S Corporation

1. Reporting entity - Capital Asset Management CJSCcapital.com.am/wp-content/uploads/2016/04/Notes-to... · 1. Reporting entity apital nvestments (t he Company), together with its

Entity Services Developer’s Guidedocs.marklogic.com/guide/entity-services.pdfMarkLogic Server Introduction to Entity Services MarkLogic 10—May, 2019 Entity Services Developer’s

Sanction Entity Location of entity expiration notice

Entity / Attributes Entity represents a person / thing Entity represents an Access table Attributes describe facets of an entity Attributes represents

t hing entity continuant dependent_continuant specifically_dependent_continuant

Entity by Telemarketer - Airtel India...airtel Get Started Your experience with AIRTEL DI-T begins with creating Entities Welcome Telemarketer Add at least one entity to proceed further

Initiator Entity Receiver Entity TRANSPORTATION Activity · 2010-03-16 · Initiator Entity Receiver Entity TRANSPORTATION Activity ESF#1-Transportation Federal agencies Assist Federal

Entity Registrationd24cdstip7q8pz.cloudfront.net/t/LeadSquaredSNS/content/common/documents...It would be perfect if you perused the manual before starting your Journey as a entity

ENMA 6010: Entity Relationship Diagrams1 ENMA 6010: Entity Relationship Diagrams Reference: Wikipedia – Entity Relationship DiagramsWikipedia – Entity

Entity-Relationship Models of Information Artefactsdbenyon/IJHCSpaper.pdf · Entity-Relationship Models of Information Artefacts T. R. G. Green MRC Applied ... In the following section

2014 PV Performance Modeling Workshop: SolarAnywhere: WebWeb-Accessible Irradiance Measurement Software: Skip Dise, Clean Power Research

t 802 LLC - grouper.ieee.orggrouper.ieee.org › groups › 802 › 1 › files › public › docs2020 › maint-M… · LLC Entity (MAC Client) LLC Entity (MAC Client) ( ) LLC Client

Semantic and Distributed Entity Search in the Web of Data€¦ · Entity Search and the Web of Data Centralised Entity Search Federated Entity Search P2P Entity Search Conclusions

03 T Entity-Relationship Model

Chapter 2: Entity-Relationship Model Entity Sets

Enhanced Entity-Relationship Modeling. Entity, Relationship, Attribute Ebay: –Bid: Is this an entity or relationship? –Item and images Is image an entity

WebWeb--based Software Platform for based Software ...(stelian.brad@staff.utcluj.ro) Technical University of Cluj-Napoca. EU FP7; ICT: Theme 3 ... RAAL (Romania) [or IPEC, Comelf,

Modelagem Básica Entity –Exemplos: entity identifier is [ port (port_interface_list);] {entity_declarative_item} end [entity] [identifier]; Interface_list