Cenitpede: Analyzing Webcrawl

Centipede: Analyzing Web Crawl data for context of a

location

Vikas BansalPrimal PappachanAbhishek Sethi

Introduction

Introduction

Description

A web service that presents the context associated with a location

Context of a location

1. Weather2. Healthcare3. Crime4. Employment5. ……

Customers

1. Moving/Travelling into a new place2. Policy Makers3. Journalists4. Researchers

Scenario

Related Services

● Yelp● Google news● http://bestplaces.net/● http://www.nycgo.com/events/● http://www.stubhub.com/

http://bestplaces.net/

http://bestplaces.net/

http://www.nycgo.com/events/

http://www.nycgo.com/events/

http://www.stubhub.com/

http://www.stubhub.com/

Technical Description of Service

● Analyze the web crawl data● Create a list of locations ● Filter top 100 words from the files that

mention a location from the list● Build an index of location against list of

words corresponding to that location

System Architecture

Data Sources

•Common Crawl Data from Amazon S3–Contains information on billions of web pages–Search through the contents–Use ARC and Text files

Technologies and Resources

● Hadoop Cluster on Bluegrit System● Apache Pig

○ Python for UDF’s● Java/PHP for front end development

○ Use a Jboss container for Java, Xampp for PHP ● Elastic Search● Map Reduce● SQL/NoSQL database● REST● WSDL 2.0● AWS - RDS, R53, EC2

MapReduce Job

Splitter● Sentence ● Paragraph● Article

Elastic Search

● Distributed restful search and analytics.● Has near real-time search.● Resilient clusters - detect and remove failed

nodes.

Challenges and Limitations

•Amount of HDD space available.•Learning new technologies such as Apache Pig, WSDL etc.•Creating special UDF’s in Python.

Timeline

References

● Data set ● Common Crawl Web data ● Elastic Search ● Apache Pig ● Elastic Search for Term Filter lookup● Hadoop Tutorial● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data

processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set

https://commoncrawl.atlassian.net/wiki/display/CRWL/About+the+Data+Set

http://commoncrawl.org/get-started/

http://commoncrawl.org/get-started/

http://www.elasticsearch.org/blog/terms-filter-lookup/


http://pig.apache.org/

http://pig.apache.org/



http://developer.yahoo.com/hadoop/tutorial/module1.html

http://developer.yahoo.com/hadoop/tutorial/module1.html

Technology

Cenitpede: Analyzing Webcrawl