14
Houston Hadoop Meetup 2/12/14 Nutch + Hadoop with Selenium and Burp By Mark Kerzner, Elephant Scale

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Embed Size (px)

DESCRIPTION

Presented at Houston Hadoop Meetup in March '14

Citation preview

Page 1: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Houston Hadoop Meetup2/12/14

Nutch + Hadoop with Selenium and Burp

By Mark Kerzner, Elephant Scale

Page 2: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Nutch story

• Created by Doug Cutting to crawl the web

• Not scalable

• Enter HDFS

• Nutch on HDFS

• Nutch on Hadoop

• Nutch 1.x, Nutch 2.x

Page 3: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Nutch 1.x

• Local or HDFS

• Command-line

• Crawl-db

Page 4: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace

# accept anything else

+.

• Use a regular expression matching the domain you wish to

crawl.

• For example, to crawl only nutch.apache.org domain

+^http://([a-z0-9]*\.)*nutch.apache.org/

Page 5: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Nutch architecture

Page 6: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Solr integration

Page 7: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Solr Application (FreeEed, demo)

Page 8: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Scaling Nutch

• HDFS – scaling storage

• MapReduce – scale crawling

• Gora – scale back end

Page 9: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Gora

• Data Persistence : Persisting objects to Column stores

such as HBase, Cassandra, Hypertable, Voldermort,

Redis, etc; SQL databases, such as MySQL, HSQLDB, flat

files in local file system of Hadoop HDFS

• Data Access : Java-friendly API for accessing the data

regardless of its location

• Indexing : Solr

• Analysis Apache Pig, Apache Hive and Cascading

• MapReduce support

Page 10: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Passwords? – Oops!

1. Burp + HttpClient

2. Selenium + Java

Page 11: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Burp (with demo)

Page 12: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

HttpClientCloseableHttpClient httpclient = HttpClients.createDefault();

try {

HttpPost httpPost = new HttpPost(getUrl());

// put in all custom headers

Map<String, String> headers = getHeaders();

for (Map.Entry<String, String> header : headers.entrySet()) {

httpPost.addHeader(header.getKey(), header.getValue());

}

HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8"));

httpPost.setEntity(entity);

response = httpclient.execute(httpPost);

Page 13: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Browser interaction? – Oops!

Selenium

Selenium + Java

Page 14: Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)

Selenium (with demo) WebDriver driver = new FirefoxDriver();

// Go to the login page

driver.get("https://mysite.com");

// put in the username

WebElement query = driver.findElement(By.name("username-element"));

query.sendKeys("your-user-name");

// put in the password

query = driver.findElement(By.name("password-element"));

query.sendKeys("real-password");

((JavascriptExecutor) driver).executeScript("javascript:whatever-login-

script();");