17
Centipede: Analyzing Web Crawl data for context of a location Vikas Bansal Primal Pappachan Abhishek Sethi

Cenitpede: Analyzing Webcrawl

Embed Size (px)

Citation preview

Page 1: Cenitpede: Analyzing Webcrawl

Centipede: Analyzing Web Crawl data for context of a

location

Vikas BansalPrimal PappachanAbhishek Sethi

Page 2: Cenitpede: Analyzing Webcrawl

Introduction

Page 3: Cenitpede: Analyzing Webcrawl

Introduction

Page 4: Cenitpede: Analyzing Webcrawl

Description

A web service that presents the context associated with a location

Page 5: Cenitpede: Analyzing Webcrawl

Context of a location

1. Weather2. Healthcare3. Crime4. Employment5. ……

Page 6: Cenitpede: Analyzing Webcrawl

Customers

1. Moving/Travelling into a new place2. Policy Makers3. Journalists4. Researchers

Page 7: Cenitpede: Analyzing Webcrawl

Scenario

Page 8: Cenitpede: Analyzing Webcrawl

Related Services

● Yelp● Google news● http://bestplaces.net/● http://www.nycgo.com/events/● http://www.stubhub.com/

Page 9: Cenitpede: Analyzing Webcrawl

Technical Description of Service

● Analyze the web crawl data● Create a list of locations ● Filter top 100 words from the files that

mention a location from the list● Build an index of location against list of

words corresponding to that location

Page 10: Cenitpede: Analyzing Webcrawl

System Architecture

Page 11: Cenitpede: Analyzing Webcrawl

Data Sources

•Common Crawl Data from Amazon S3–Contains information on billions of web pages–Search through the contents–Use ARC and Text files

Page 12: Cenitpede: Analyzing Webcrawl

Technologies and Resources

● Hadoop Cluster on Bluegrit System● Apache Pig

○ Python for UDF’s● Java/PHP for front end development

○ Use a Jboss container for Java, Xampp for PHP ● Elastic Search● Map Reduce● SQL/NoSQL database● REST● WSDL 2.0● AWS - RDS, R53, EC2

Page 13: Cenitpede: Analyzing Webcrawl

MapReduce Job

Splitter● Sentence ● Paragraph● Article

Page 14: Cenitpede: Analyzing Webcrawl

Elastic Search

● Distributed restful search and analytics.● Has near real-time search.● Resilient clusters - detect and remove failed

nodes.

Page 15: Cenitpede: Analyzing Webcrawl

Challenges and Limitations

•Amount of HDD space available.•Learning new technologies such as Apache Pig, WSDL etc.•Creating special UDF’s in Python.

Page 16: Cenitpede: Analyzing Webcrawl

Timeline

Page 17: Cenitpede: Analyzing Webcrawl

References

● Data set ● Common Crawl Web data ● Elastic Search ● Apache Pig ● Elastic Search for Term Filter lookup● Hadoop Tutorial● Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data

processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

● Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.