VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages

VIRGINIA TECH BLACKSBURGCS 4624MUSTAFA ALY & GASPER GULOTTA

CLIENT: MOHAMED MAGDY

IDEAL Pages

BACKGROUND

The IDEAL Project aims to provide convenient access to webpages related to various types of disasters

Currently this information is stored in about 10TB of Web Archives

Need to extract this information efficiently and index it

Provide a user interface for easy to use access

IDEAL PROJECT – (BORROWED FROM SUNSHIN LEE)

SOLUTION APPROACH

Automate the process of:Extracting the Web Archives (.warc files)HTML parsing and indexing into Solr

Use of Hadoop for distributed processing Webpages for displaying Solr search results and sorting

disasters by categoryMake the process reusable on other archives

PROJECT ARCHITECTURE

Event crawled by Heratrix Cralwer

WARC Files

Webpage Files

HTML Files

Solr

Hadoop

Interface• Browsing• Visualizing• Categories

OUR ROLES

.warc file extractionFiltering of HTML filesText extracting from HTML filesIndexing information into SolrMap/Reduce Script for Hadoop

WORK COMPLETED

Set up Python environmentObtained a set of test .warc filesSimplified the process of extracting a .warc file Identification of HTML files from the resulting extractionExpand process of extracting .warc files to multiple files/directoriesExtracting text from HTML files Indexing information into Solr

EXTRACTING WARC FILE

Integrated Hanzo Warc Tools (https://pypi.python.org/pypi/hanzo-warc-tools/0.2)Only takes one warc file at a timeUnpacks warc file into HTTP folder and HTTPS folderCreates text file to be used later

Created script to allow for full directory to be unpacked

https://pypi.python.org/pypi/hanzo-warc-tools/0.2

https://pypi.python.org/pypi/hanzo-warc-tools/0.2

WARC EXTRACTION EXAMPLE OUTPUT

warc_file warc_con_len

warc_uri_date

warc_subject_uri uri_content_type

outfile

fn.warc.gz

3284 2013-04-21

www.vt.edu/robots.txt text/plain /Users/http/robots/txt

fn2.warc.gz

1023 2013-04-21

www.vt.edu/cs.html text/html /Users/https/cs.html

fn3.warc.gz

4983 2013-04-21

www.vt.edu/logo.png image/jpg /Users/http/logo.png

HTML EXTRACTION

Find HTML documents based on uri_content_type column in text file

Use the outfile column to locate where the file is an extract HTML file

INDEXING FILES INTO SOLR

Text extracted from HTML files using BeautifulSoup4 (http://www.crummy.com/software/BeautifulSoup/)

Indexed into Solr using solrpy (https://code.google.com/p/solrpy/)

Use fields for: id

content

collection_id

event

event_type

URL

wayback_URL

http://www.crummy.com/software/BeautifulSoup/

http://www.crummy.com/software/BeautifulSoup/

https://code.google.com/p/solrpy/

https://code.google.com/p/solrpy/

WORK REMAINING

Work with client to integrate the process with Hadoop

QUESTIONS?

Documents

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages