Upload
loraine-holland
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Web Archive Content Analysis: Disaster Events Case Study
IIPC 2015 General Assembly Stanford University and Internet Archive
Mohamed Farag Dr. Edward A. Fox
[email protected], [email protected]
DLRL, CS @ Virginia TechApril 27 – May 1, 2015
Acknowledgments
• Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library
Testbed for Research Related to 4/16/2007 at Virginia Tech– 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery
network (CTRnet)– 2013-2016: NSF IIS-1319578, Integrated Digital Event
Archive & Library (IDEAL)• The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support– Hosting the crawls and resulting archives
IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker; and GRA Sunshin Lee
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
Building archives for events – 1Manual Curation
• We have created ~ 60 collections ( https://archive-it.org/organizations/156 )
• These collections are about disaster events: bombings, earthquakes, hurricanes, plane crashes, shootings, floods, fires
• Manual preparation of URLs and archiving using Archive-it service
Sample Web CollectionsCollection Name No. of Seeds
Alabama University Shooting 116April 16 Archive 88Chile Earthquake 19Nevada air race crash 64China Floods 60Encephalitis (India) 59Hurricane Irene 70
Building archives for events - 2Seeds from social media (Twitter)
• We created more than 600 tweet collections with ~ 1 billion tweets
• For each collection we extract URLs in the tweets, fetch webpages, and archive just those webpages
• Webpage collections are of two types:– Disaster events: shootings, earthquakes, plane
crashes, hurricanes, bombings, terrorism, floods, fire
– Community and political events
Sample Tweet CollectionsCollection Keywords/Hashtags No. of Tweets Start dateHurricane Sandy hurricane sandy 3,219,383 2012-10-26Ebola #ebola 1,855,680 2014-07-30Ferguson shooting #Ferguson 1,580,479 2014-08-11
Thanksgiving #Thanksgiving 214,888 2014-11-20AirAsia Plane Crash #QZ8501 174,353 2014-12-30
Charlie Hebdo shooting #CharlieHebdo 451,009 2015-01-07Iran Talks #IranTalks 117,966 2015-04-02
For full list check: http://hadoop.dlib.vt.edu:81/twitter/
Building archives for events - 2 Seeds from social media
Event
Collect Tweets
Tweet Collection
Extract URLs
Shortened URLs
Expand Original Webpages
Archive WARC
Index SOLR
Browse
Wayback
Search
Access
Keyword/Hashtag
Collect Archive/Organize/Analyze
Building archives for events - 3 Focused Crawling
• Curator selects high quality seed URLs• Use Event Focused Crawler (EFC) to retrieve
webpages that are highly similar to those with the seed URLs
• Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold)
Building archives for events - 3 Focused Crawling
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
Event Model and Representation
• Modeling events– What happened, where, and when
• Information retrieval– Helps find What part (Vector Space/Probabilistic)
• Natural language processing– Helps find Where and When parts (Named Entity
Recognition)
Event Model and Representation
• Educational activities– CS4984 Computational Linguistics (Fall 2014)– CS5604 Information Retrieval (Spring 2015)
• Equipment– Hadoop cluster with 20 data nodes– 612 RAM, 76 Cores, and 60 TB Disk
• Processing methods– Stanford Named Entity Recognition– Mahout routine for topic identification– Python programming for text analysis (Hadoop streaming)
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future Work
Assessing archive quality using event model
• Approaches to textual and linguistic analysis of an archive– Frequent and important words in whole collection– Important sentences, sentences that have one or
more frequent words– Frequent entities (location and dates) extracted
from important sentences
Assessing archive quality using event model
Aggregation
Named Entity
Recognition
SentenceTokenization
KeywordMatching
TextExtraction
Event Model
Topic: (t1,t2,..,tn)Location: (l1,l2,..,ln)Date: (d1,d2,…,dn)
Sentences Selected Sentences
Event Entities
Text ContentWebpages
Frequent Words
Frequency Analysis
Example• Ebola Outbreak (22 documents)• Top 10 frequent words and top 2 sentences
which includes 2 or more frequent words
Frequent Words Important Sentences Extracted Entities
EbolaVirusDiseaseHealth2014AfricaWestAgoUniversityOutbreak
- Outbreak of Ebola virus disease in West Africa: third update, 1 August 2014. (7)
DATE: ['August 2014'], LOCATION: ['West Africa']
- ECDC (2014) Outbreak of Ebola virus disease in West Africa. (7)
LOCATION: [u'West Africa']
Outline
• Building archives for events• Event modeling and representation• Assessing archive quality using event model• Quality tool and results• Future work
Archive Quality Assessment
• http://nick.dlib.vt.edu/EventModel/• Input: – Existing collections, WARC file, Text file with list of
URLs• Frontend: HTML, Javascript/Dojo• Backend: Python, NLTK
Sample Results
Future Work
• Use event model to:– Summarize event collection (generate most
informative sentence)– Extract relevant parts from webpage
Thank YouQuestions?
Mohamed FaragDr. Edward A. Fox
IDEAL Interface
• http://nick.dlib.vt.edu/ideal/collections/index.php
• Collections– 11 events categories , 2 events each (Small and
Big size)– Total 1.6 M documents
• Services:– Search (keywords, web collections text)– Browse (Event categories and events metadata,
web and tweet collections)
Technologies
• Search engine– Solr 4.9 (http://lucene.apache.org/solr/)
• Web Interface– Apache server– JavaScript - Solr library
(https://github.com/evolvingweb/ajax-solr/wiki )• Tweets archiving
– yourTwapperKeeper (https://github.com/540co/yourtwapperkeeper )
• Webpages archiving– Archive-it service from Internet Archive
(https://archive-it.org/ )
CollectionsCategory/Collection Big Small
Accident Train derailment in Quebec Texas factory explosion
Bombing Boston bombing Somalia Blast
Community Blacksburg events Labor day and world cup 2014
Disease Outbreak Ebola encephalitis
Earthquake Turkey earthquake Virginia earthquake and others
Fire Brazil night club fire Texas wild fire
Flood Pakistan flood China flood and Islip 13 inch rain
Hurricane Hurricane Sandy Typhoon Haiyan
Plane Crash Russia Plane Crash Nevada air race crash
Shooting April 16 shooting Norway shooting and others
Search Interface
Searching Sandy
Faceted SearchSearch all events under Fire
Faceted SearchSearch Brazil Night Club Fire
Browse Interface
Select Event Type
Select Event
Hurricane Events