35
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A. Fox mmagdy@ vt.edu , [email protected] DLRL, CS @ Virginia Tech April 27 – May 1, 2015

Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Embed Size (px)

Citation preview

Page 1: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Web Archive Content Analysis: Disaster Events Case Study

IIPC 2015 General Assembly Stanford University and Internet Archive

Mohamed Farag Dr. Edward A. Fox

[email protected], [email protected]

DLRL, CS @ Virginia TechApril 27 – May 1, 2015

Page 2: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Acknowledgments

• Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library

Testbed for Research Related to 4/16/2007 at Virginia Tech– 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery

network (CTRnet)– 2013-2016: NSF IIS-1319578, Integrated Digital Event

Archive & Library (IDEAL)• The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support– Hosting the crawls and resulting archives

IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker; and GRA Sunshin Lee

Page 3: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Outline

• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work

Page 4: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Building archives for events – 1Manual Curation

• We have created ~ 60 collections ( https://archive-it.org/organizations/156 )

• These collections are about disaster events: bombings, earthquakes, hurricanes, plane crashes, shootings, floods, fires

• Manual preparation of URLs and archiving using Archive-it service

Page 5: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Sample Web CollectionsCollection Name No. of Seeds

Alabama University Shooting 116April 16 Archive 88Chile Earthquake 19Nevada air race crash 64China Floods 60Encephalitis (India) 59Hurricane Irene 70

Page 6: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Building archives for events - 2Seeds from social media (Twitter)

• We created more than 600 tweet collections with ~ 1 billion tweets

• For each collection we extract URLs in the tweets, fetch webpages, and archive just those webpages

• Webpage collections are of two types:– Disaster events: shootings, earthquakes, plane

crashes, hurricanes, bombings, terrorism, floods, fire

– Community and political events

Page 7: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Sample Tweet CollectionsCollection Keywords/Hashtags No. of Tweets Start dateHurricane Sandy hurricane sandy 3,219,383 2012-10-26Ebola #ebola 1,855,680 2014-07-30Ferguson shooting #Ferguson 1,580,479 2014-08-11

Thanksgiving #Thanksgiving 214,888 2014-11-20AirAsia Plane Crash #QZ8501 174,353 2014-12-30

Charlie Hebdo shooting #CharlieHebdo 451,009 2015-01-07Iran Talks #IranTalks 117,966 2015-04-02

For full list check: http://hadoop.dlib.vt.edu:81/twitter/

Page 8: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Building archives for events - 2 Seeds from social media

Event

Collect Tweets

Tweet Collection

Extract URLs

Shortened URLs

Expand Original Webpages

Archive WARC

Index SOLR

Browse

Wayback

Search

Access

Keyword/Hashtag

Collect Archive/Organize/Analyze

Page 9: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Building archives for events - 3 Focused Crawling

• Curator selects high quality seed URLs• Use Event Focused Crawler (EFC) to retrieve

webpages that are highly similar to those with the seed URLs

• Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold)

Page 10: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Building archives for events - 3 Focused Crawling

Page 11: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Outline

• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work

Page 12: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Event Model and Representation

• Modeling events– What happened, where, and when

• Information retrieval– Helps find What part (Vector Space/Probabilistic)

• Natural language processing– Helps find Where and When parts (Named Entity

Recognition)

Page 13: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Event Model and Representation

• Educational activities– CS4984 Computational Linguistics (Fall 2014)– CS5604 Information Retrieval (Spring 2015)

• Equipment– Hadoop cluster with 20 data nodes– 612 RAM, 76 Cores, and 60 TB Disk

• Processing methods– Stanford Named Entity Recognition– Mahout routine for topic identification– Python programming for text analysis (Hadoop streaming)

Page 14: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Outline

• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work

Page 15: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Assessing archive quality using event model

• Approaches to textual and linguistic analysis of an archive– Frequent and important words in whole collection– Important sentences, sentences that have one or

more frequent words– Frequent entities (location and dates) extracted

from important sentences

Page 16: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Assessing archive quality using event model

Aggregation

Named Entity

Recognition

SentenceTokenization

KeywordMatching

TextExtraction

Event Model

Topic: (t1,t2,..,tn)Location: (l1,l2,..,ln)Date: (d1,d2,…,dn)

Sentences Selected Sentences

Event Entities

Text ContentWebpages

Frequent Words

Frequency Analysis

Page 17: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Example• Ebola Outbreak (22 documents)• Top 10 frequent words and top 2 sentences

which includes 2 or more frequent words

Frequent Words Important Sentences Extracted Entities

EbolaVirusDiseaseHealth2014AfricaWestAgoUniversityOutbreak

- Outbreak of Ebola virus disease in West Africa: third update, 1 August 2014. (7)

DATE: ['August 2014'], LOCATION: ['West Africa']

- ECDC (2014) Outbreak of Ebola virus disease in West Africa. (7)

LOCATION: [u'West Africa']

Page 18: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Outline

• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future work

Page 19: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Archive Quality Assessment

• http://nick.dlib.vt.edu/EventModel/• Input: – Existing collections, WARC file, Text file with list of

URLs• Frontend: HTML, Javascript/Dojo• Backend: Python, NLTK

Page 20: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Sample Results

Page 21: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A
Page 22: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A
Page 23: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Future Work

• Use event model to:– Summarize event collection (generate most

informative sentence)– Extract relevant parts from webpage

Page 25: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

IDEAL Interface

• http://nick.dlib.vt.edu/ideal/collections/index.php

• Collections– 11 events categories , 2 events each (Small and

Big size)– Total 1.6 M documents

• Services:– Search (keywords, web collections text)– Browse (Event categories and events metadata,

web and tweet collections)

Page 26: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Technologies

• Search engine– Solr 4.9 (http://lucene.apache.org/solr/)

• Web Interface– Apache server– JavaScript - Solr library

(https://github.com/evolvingweb/ajax-solr/wiki )• Tweets archiving

– yourTwapperKeeper (https://github.com/540co/yourtwapperkeeper )

• Webpages archiving– Archive-it service from Internet Archive

(https://archive-it.org/ )

Page 27: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

CollectionsCategory/Collection Big Small

Accident Train derailment in Quebec Texas factory explosion

Bombing Boston bombing Somalia Blast

Community Blacksburg events Labor day and world cup 2014

Disease Outbreak Ebola encephalitis

Earthquake Turkey earthquake Virginia earthquake and others

Fire Brazil night club fire Texas wild fire

Flood Pakistan flood China flood and Islip 13 inch rain

Hurricane Hurricane Sandy Typhoon Haiyan

Plane Crash Russia Plane Crash Nevada air race crash

Shooting April 16 shooting Norway shooting and others

Page 28: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Search Interface

Page 29: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Searching Sandy

Page 30: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Faceted SearchSearch all events under Fire

Page 31: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Faceted SearchSearch Brazil Night Club Fire

Page 32: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Browse Interface

Page 33: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Select Event Type

Page 34: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Select Event

Page 35: Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A

Hurricane Events