An introduction to Storm Crawler

An introduction to

Julien Niochejulien@digitalpebble.com

@digitalpebble

Java Meetup – Bristol 10/03/2015

Storm Crawler

2 / 31

About myself

DigitalPebble Ltd, Bristol (UK)

Specialised in Text Engineering– Web Crawling– Natural Language Processing– Information Retrieval– Machine Learning

Strong focus on Open Source & Apache ecosystem

User | Contributor | Committer– Nutch– Tika– SOLR, Lucene – GATE, UIMA– Behemoth

3 / 31

Collection of resources (SDK) for building web crawlers on Apache Storm

https://github.com/DigitalPebble/storm-crawler Artefacts available from Maven Central Apache License v2 Active and growing fast

=> Scalable=> Low latency=> Easily extensible=> Polite

Storm-Crawler : what is it?

4 / 31

What it is not

A ready-to-use, feature-complete web crawler– Will provide that as a separate project using S/C later

e.g. No PageRank or explicit ranking of pages– Build your own

No fancy UI, dashboards, etc...– Build your own or integrate in your existing ones

Compare with Apache Nutch later

5 / 31

Apache Storm

Distributed real time computation system

http://storm.apache.org/

Scalable, fault-tolerant, polyglot and fun!

Implemented in Clojure + Java

“Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”

6 / 31

Storm Cluster Layout

http://cdn.oreillystatic.com/en/assets/1/event/85/Realtime%20Processing%20with%20Storm%20Presentation.pdf

7 / 31

Topology = spouts + bolts + tuples + streams

8 / 31

Basic Crawl Topology

Fetcher

Parser

URLPartitioner

Indexer

Some Spout(1) pull URLs from an external source

(2) group per host / domain / IP

(3) fetch URLs

(4) parse content

(3) fetch URLs

(5) index text and metadata

9 / 31

Basic Topology : Tuple perspective

Fetcher

Parser

URLPartitioner

Indexer

Some Spout(1) <String url, Metadata m>

(2) <String url, String key, Metadata m>

(4) <String url, byte[] content, Metadata m, String text>

(3) <String url, byte[] content, Metadata m>

10 / 31

Basic scenario

Spout : simple queue e.g. RabbitMQ, AWS SQS, etc... Indexer : any form of storage but often index e.g. SOLR,

ElasticSearch

Components in grey : default ones from SC

Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.

“Great! What about recursive crawls and / or failure handling?”

11 / 31

Frontier expansion

Manual “discovery”– Adding new URLs by

hand, “seeding”

Automatic discovery of new resources (frontier expansion)– Not all outlinks are

equally useful - control– Requires content

parsing and link extraction

[Slide courtesy of A. Bialecki]

12 / 31

Recursive Topology

Fetcher

Parser

URLPartitioner

Indexer

Some Spout

Status Updater

Status stream

Switch

13 / 31

Status stream

Used for handling errors and update info about URLs Newly discovered URLs from parser

public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}

StatusUpdater : writes to storage Can/should extend AbstractStatusUpdaterBolt

14 / 31

Error vs failure in SC

Failure : structural issue e.g. dead node– Storm's mechanism based on time-out– Spout method ack() / fail() get message ID

• can decide whether to replay tuple or not• message ID only => limited info about what went wrong

Error : data issue– Some temporary e.g. target server temporarily unavailable– Some fatal e.g. document not parsable– Handled via Status stream – StatusUpdater can decide what to do with them

• Change status? Change scheduling?

15 / 31

Topology in code

Grouping!

Politeness and data locality

16 / 31

Launching the Storm Crawler

Modify resources in src/main/resources– URLFilters, ParseFilters etc...

Build uber-jar with Maven– mvn clean package

Set the configuration– YAML file (e.g. crawler-conf.yaml)

Call Stormstorm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml

17 / 31

18 / 31

Comparison with Nutch

19 / 31

Comparison with Nutch

Nutch is batch driven : little control on when URLs are fetched– Potential issue for use cases where need sessions– latency++

Fetching only one of the steps in Nutch– SC : 'always be fetching' (Ken Krugler); better use of resources

Make it even more flexible– Typical case : few custom classes (at least a Topology) the rest are just

dependencies and standard S/C components

Not ready-to use as Nutch : it's a SDK Would not have existed without it

– Borrowed code and concepts

20 / 31

Overview of resources

https://www.flickr.com/photos/dipster1/1403240351/

21 / 31

Resources

Separated in core vs external Core

– URLPartitioner– Fetcher– Parser– SitemapParser– …

External– Metrics-related : inc. connector for Librato– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics

Also user maintained external resources

22 / 31

FetcherBolt

Multi-threaded Polite

– Puts incoming tuples into internal queues based on IP/domain/hostname– Sets delay between requests from same queue– Respects robots.txt

Protocol-neutral– Protocol implementations are pluggable– HTTP implementation taken from Nutch

23 / 31

ParserBolt

Based on Apache Tika Supports most commonly used doc formats

– HTML, PDF, DOC etc...

Calls ParseFilters on document– e.g. scrape info with XPathFilter

Calls URLFilters on outlinks– e.g normalize and / or blacklists URLs based on RegExps

24 / 31

Other resources

URLPartitionerBolt– Generates a key based on the hostname / domain / IP of URL– Output :

• String URL• String key• String metadata

– Useful for fieldGrouping

URLFilters– Control crawl expansion– Delete or rewrites URLs– Configured in JSON file– Interface

• String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)

25 / 31

Other resources

ParseFilters– Called from ParserBolt– Extracts information from document– Configured in JSON file– Interface

• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)

– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java

• Xpath expressions

• Info extracted stored in metadata

• Used later for indexing

26 / 31

Integrate it!

Write the Spout for your usecase– Will work fine existing resources as long as it generates URL, Metadata

Typical scenario– Group URLs to fetch into separate external queues based on host or

domain (AWS SQS, Apache Kafka)– Write Spout for it and throttle with topology.max.spout.pending– So that can enforce politeness without getting timeout on Tuples → fail– Parse and extract– Send new URLs to queues

Can use various forms of persistence for URLs– ElasticSearch, DynamoDB, HBase, etc...

27 / 31

Some use cases (prototype stage)

Processing of streams of URLs (natural fit for Storm)– http://www.weborama.com

Monitoring of finite set of URLs– http://www.ontopic.io (more on them later)– http://www.shopstyle.com : scraping + indexing

One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing

Recursive crawler– http://www.shopstyle.com : discovery of product pages

28 / 31

What's next?

0.5 released soon?

All-in-one crawler based on ElasticSearch

– Also a good example of how to use SC

– Separate project

Additional Parse/URLFilters

Facilitate writing IndexingBolts

An introduction to Storm Crawler

Software

Module1 Introduction Relating to Storm Drainage … to the Rules Relating to Storm Drainage Standards ... states, D.C., and Canada Increased ... The landland toto bebe developeddeveloped

Storm Identification, Tracking and Nowcasting - CIMMScimms.ou.edu/~lakshman/Papers/storm_id_tracking.pdf · Storm Identification, Tracking and Nowcasting . ... 2 Introduction Storm

Special Features The Introduction of Crawler Tractors to .... x. Crawler Tractor.pdf · The Introduction of Crawler Tractors to the Logging Industry ... Department Annual Report for

Introduction to Storm

Introduction to Storm Surge - Severe Weather Information ... · STORM SURGE is an abnormal rise of water generated by a storm, over and above the predicted astronomical tide. STORM

Short introduction to Storm

Introduction to Ohio Storm Water Regulationscrwp.org/files/ResSCMs_Introduction_Ohio_StormWater_Regulations.… · NPDES Storm Water Program Municipal Storm Water Communities located

MDOT Storm Water Management Plan Module 1: Introduction to Storm Water Management Together… Better Roads, Cleaner Streams

Storm Water P2 1 / 61 © Copyright Training 4 Today 2001 Published by EnviroWin Software LLC WELCOME INTRODUCTION TO STORM WATER POLLUTION PREVENTION CUSTOMIZED

Introduction to Autodesk Storm and Sanitary Analysisaucache.autodesk.com/au2012/sessionsFiles/6901/4566/...Introduction to Autodesk Storm and Sanitary Analysis 2 Table of Contents

Crawler-Transporters Crawler-Transporter acts Crawler

INTRODUCTION TO STORM WATER

TU STORM 2007 High School Autonomous Robotics Competition Introduction

A SPLITTING STORM IN SOUTHWEST IDAHO David … Storm.pdfA SPLITTING STORM IN SOUTHWEST IDAHO David B. Billingsley, WSFO BOI Introduction ... storms. Chapter 15, Mesoscale Meteorology

Introduction to Apache NiFi And Storm

D2 Storm Drain Restoration and Storm Water Mitigation Job... · D2 Storm Drain Restoration and Storm Water Mitigation ... D2 Storm Drain Restoration and Storm Water Mitigation

Kafka and Storm A Gentle Introduction of - Percona integrators A Gentle... · A Gentle Introduction of Kafka and Storm ... A Gentle Introduction • What Kafka and Storm are?

232P-2013: Weathering the Storm: Using Predictive Analytics to … · 2013-04-04 · Weathering the Storm: Using Predictive Analytics to Minimize Utility Outages Introduction The

Chapter 9 - Storm DrainsChapter 9-1 of 70 Chapter 9 - Storm Drains 9.1 Introduction 9.1.1 Objective This chapter provides guidance on storm drain design and analysis. The quality of

Storm Water Management Plan - SanDiegoCounty.gov · Storm Water Management Plan Lago de San Marcos Page 1 JN 146-33 (January 2008) Introduction Storm water discharges from land and