Upload
markkerzner
View
607
Download
0
Embed Size (px)
DESCRIPTION
Presented at Houston Hadoop Meetup in March '14
Citation preview
Houston Hadoop Meetup2/12/14
Nutch + Hadoop with Selenium and Burp
By Mark Kerzner, Elephant Scale
Nutch story
• Created by Doug Cutting to crawl the web
• Not scalable
• Enter HDFS
• Nutch on HDFS
• Nutch on Hadoop
• Nutch 1.x, Nutch 2.x
Nutch 1.x
• Local or HDFS
• Command-line
• Crawl-db
Configuring Nutch • Edit the file conf/regex-urlfilter.txt and replace
# accept anything else
+.
• Use a regular expression matching the domain you wish to
crawl.
• For example, to crawl only nutch.apache.org domain
+^http://([a-z0-9]*\.)*nutch.apache.org/
Nutch architecture
Solr integration
Solr Application (FreeEed, demo)
Scaling Nutch
• HDFS – scaling storage
• MapReduce – scale crawling
• Gora – scale back end
Gora
• Data Persistence : Persisting objects to Column stores
such as HBase, Cassandra, Hypertable, Voldermort,
Redis, etc; SQL databases, such as MySQL, HSQLDB, flat
files in local file system of Hadoop HDFS
• Data Access : Java-friendly API for accessing the data
regardless of its location
• Indexing : Solr
• Analysis Apache Pig, Apache Hive and Cascading
• MapReduce support
Passwords? – Oops!
1. Burp + HttpClient
2. Selenium + Java
Burp (with demo)
HttpClientCloseableHttpClient httpclient = HttpClients.createDefault();
try {
HttpPost httpPost = new HttpPost(getUrl());
// put in all custom headers
Map<String, String> headers = getHeaders();
for (Map.Entry<String, String> header : headers.entrySet()) {
httpPost.addHeader(header.getKey(), header.getValue());
}
HttpEntity entity = new ByteArrayEntity(getPostBody().getBytes("UTF-8"));
httpPost.setEntity(entity);
response = httpclient.execute(httpPost);
Browser interaction? – Oops!
Selenium
Selenium + Java
Selenium (with demo) WebDriver driver = new FirefoxDriver();
// Go to the login page
driver.get("https://mysite.com");
// put in the username
WebElement query = driver.findElement(By.name("username-element"));
query.sendKeys("your-user-name");
// put in the password
query = driver.findElement(By.name("password-element"));
query.sendKeys("real-password");
((JavascriptExecutor) driver).executeScript("javascript:whatever-login-
script();");