Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Houston Hadoop Meetup2/12/14

Nutch + Hadoop with Selenium and Burp

By Mark Kerzner, Elephant Scale

Nutch story

• Created by Doug Cutting to crawl the web

• Not scalable

• Enter HDFS

• Nutch on HDFS

• Nutch on Hadoop

• Nutch 1.x, Nutch 2.x

Nutch 1.x

• Local or HDFS

• Command-line

• Crawl-db

Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace

# accept anything else

+.

• Use a regular expression matching the domain you wish to

crawl.

• For example, to crawl only nutch.apache.org domain

+^http://([a-z0-9]*\.)*nutch.apache.org/

Nutch architecture

Solr integration

Solr Application (FreeEed, demo)

Scaling Nutch

• HDFS – scaling storage

• MapReduce – scale crawling

• Gora – scale back end

Gora

• Data Persistence : Persisting objects to Column stores

such as HBase, Cassandra, Hypertable, Voldermort,

Redis, etc; SQL databases, such as MySQL, HSQLDB, flat

files in local file system of Hadoop HDFS

• Data Access : Java-friendly API for accessing the data

regardless of its location

• Indexing : Solr

• Analysis Apache Pig, Apache Hive and Cascading

• MapReduce support

Passwords? – Oops!

1. Burp + HttpClient

2. Selenium + Java

Burp (with demo)

HttpClientCloseableHttpClient httpclient = HttpClients.createDefault();

try {

HttpPost httpPost = new HttpPost(getUrl());

// put in all custom headers

Map<String, String> headers = getHeaders();

for (Map.Entry<String, String> header : headers.entrySet()) {

httpPost.addHeader(header.getKey(), header.getValue());

}

HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8"));

httpPost.setEntity(entity);

response = httpclient.execute(httpPost);

Browser interaction? – Oops!

Selenium

Selenium + Java

Selenium (with demo) WebDriver driver = new FirefoxDriver();

// Go to the login page

driver.get("https://mysite.com");

// put in the username

WebElement query = driver.findElement(By.name("username-element"));

query.sendKeys("your-user-name");

// put in the password

query = driver.findElement(By.name("password-element"));

query.sendKeys("real-password");

((JavascriptExecutor) driver).executeScript("javascript:whatever-login-

script();");

Technology

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)