20
DevTools to crawl Webpages.

Innoplexia DevTools to Crawl Webpages

  • Upload
    d0x

  • View
    2.110

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Innoplexia DevTools to Crawl Webpages

DevTools to crawl Webpages.

Page 2: Innoplexia DevTools to Crawl Webpages

DevTools

09.05.12 2@chrschneider

Page 3: Innoplexia DevTools to Crawl Webpages

3

… Apache … toolset of low level Java components focused on HTTP and associated protocols.“

● HttpComponents Core… is a set of low level HTTP transport components

● HttpComponents Client… provides reusable components for client-side ... HTTP connection management.

● HttpComponents AsyncClient (DEV)… ability to handle a great number of concurrent connections ... more ... performance in terms of a raw data throughput.

● Commons HttpClient (Legacy)… All users of Commons HttpClient 3.x are strongly encouraged to upgrade toHttpClient 4.1.

09.05.12

DevTools

@chrschneider

Page 4: Innoplexia DevTools to Crawl Webpages

09.05.12 4

HttpComponents Client

Example Components

● Get, Post, Delete, … Request Objects

● Cookie Manager

● SSL

● Content Encoding Aware

● HTTP Authentication (Basic, Digest, ...)

DevTools

@chrschneider

Page 5: Innoplexia DevTools to Crawl Webpages

09.05.12 5

public final static void main(final String[] args) throws Exception{

final HttpClient httpclient = new DefaultHttpClient();try{

final HttpGet httpget = new HttpGet("http://www.google.com/");

System.out.println("executing request " + httpget.getURI());

// Create a response handlerfinal ResponseHandler<String> responseHandler = new BasicResponseHandler();final String responseBody = httpclient.execute(httpget, responseHandler);System.out.println("----------------------------------------");System.out.println(responseBody);System.out.println("----------------------------------------");

}finally{

httpclient.getConnectionManager().shutdown();}

}

http://hc.apache.org/httpcomponents-client-ga/examples.html

HttpComponents Client Example

DevTools

@chrschneider

Page 6: Innoplexia DevTools to Crawl Webpages

09.05.12 6

HttpComponents Client

Demo

DevTools

@chrschneider

Page 7: Innoplexia DevTools to Crawl Webpages

09.05.12 7

… is an asynchronous event-driven network application framework for rapid development of maintainable high performance protocol servers & clients.

See: http://netty.io/

DevTools

@chrschneider

Page 8: Innoplexia DevTools to Crawl Webpages

09.05.12 8

… is a "GUI-Less browser for Java programs"

Features (extraction):● Support for the HTTP and HTTPS protocols● Support for cookies● Ability to specify whether failing responses from the server should throw exceptions

or should be returned as pages of the appropriate type (based on content type)● Ability to customize the request headers being sent to the server● Support for HTML responses

● Support for submitting forms● Support for clicking links● Support for walking the DOM model of the HTML document● JavaScript support

DevTools

@chrschneider

Page 9: Innoplexia DevTools to Crawl Webpages

09.05.12 9

… is a "GUI-Less browser for Java programs"

@Testpublic void homePage() throws Exception{

final WebClient webClient = new WebClient();final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");

System.out.println(page.getTitleText());

assertEquals("Welcome to HtmlUnit", page.getTitleText());

final String pageAsXml = page.asXml();assertTrue(pageAsXml.contains("<body class=\"composite\">"));

final String pageAsText = page.asText();assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));

webClient.closeAllWindows();}

http://htmlunit.sourceforge.net/gettingStarted.html

DevTools

@chrschneider

Page 10: Innoplexia DevTools to Crawl Webpages

09.05.12 10

… is a "GUI-Less browser for Java programs"

@Testpublic void getElements() throws Exception{

final WebClient webClient = new WebClient();final HtmlPage page = webClient.getPage("http://some_url");final HtmlDivision div = page.getHtmlElementById("some_div_id");final HtmlAnchor anchor = page.getAnchorByName("anchor_name");

webClient.closeAllWindows();}

Luxus :)

http://htmlunit.sourceforge.net/gettingStarted.html

Note: Also html tables are supported. They wrote easy wrapper classes to walk though them. … Handy!http://htmlunit.sourceforge.net/table-howto.html

DevTools

@chrschneider

Page 11: Innoplexia DevTools to Crawl Webpages

09.05.12 11

… automates browsers. That's it.

Selenium-WebDriver supports the following browsers along with the operating systems these browsers are compatible with.

● Google Chrome 12.0.712.0+

● Internet Explorer 6, 7, 8, 9 - 32 and 64-bit where applicable

● Firefox 3.0, 3.5, 3.6, 4.0, 5.0, 6, 7

● Opera 11.5+

● HtmlUnit 2.9

● Android – 2.3+ for phones and tablets (devices & emulators)

● iOS 3+ for phones (devices & emulators) and 3.2+ for tablets (devices & emulators)

DevTools

@chrschneider

Page 12: Innoplexia DevTools to Crawl Webpages

09.05.12 12

… automates browsers. That's it.

Selenium IDE

Selenium WebDriver

Selenium Grid

The Selenium Family

Also c#, Phython, Ruby, ...

Also on Windows and Mac

DevTools

@chrschneider

Page 13: Innoplexia DevTools to Crawl Webpages

09.05.12 13

… automates browsers. That's it.

Selenium IDE

Selenium WebDriver

Selenium Grid

The Selenium Family

… create quick bug reproduction scripts

… create scripts to aid in automation-aided exploratory testing

… create robust, browser-based regression automation

… scale and distribute scripts across many environments

http://seleniumhq.org/

DevTools

@chrschneider

Page 14: Innoplexia DevTools to Crawl Webpages

09.05.12 14

Requirements for Selenium WebDriver with Firefox(and HtmlUnit)

<dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-java</artifactId><version>2.21.0</version>

</dependency>

<dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-htmlunit-driver</artifactId><version>2.21.0</version>

</dependency>

<dependency><groupId>org.seleniumhq.selenium</groupId><artifactId>selenium-firefox-driver</artifactId><version>2.21.0</version>

</dependency>

Dependencies Browser Binaries

That's

it.

DevTools

@chrschneider

Page 15: Innoplexia DevTools to Crawl Webpages

09.05.12 15

Basic Selenium example

@Testpublic void testSeleniumWithFirefox() throws InterruptedException{

final WebDriver webDriver = new FirefoxDriver();

webDriver.get("http://www.majug.de");

final WebElement veranstaltungenLink = webDriver.findElement(By.linkText("Veranstaltungen"));

veranstaltungenLink.click();

// Close the browserThread.sleep(5000);webDriver.quit();

}

DevTools

@chrschneider

Page 16: Innoplexia DevTools to Crawl Webpages

09.05.12 16

Selenium WebDriver Locator Strategies

It's also possible to call findElements(...) to get a List<> of WebElements.:

List<WebElement> hits = webDriver.findElements(By.tagName("a"));

DevTools

@chrschneider

Page 17: Innoplexia DevTools to Crawl Webpages

09.05.12 17

Selenium WebDriver Interactions

If you got a webElement, you can...

● webElement.click() it

● webElement.sendKeys(...) to it

● webElement.submit() on it.

It is also possible to perform “Actions“ like DoubleClick, DragAndDrop, ClickAndHold, …with the “Actions“ class.

DevTools

@chrschneider

Page 18: Innoplexia DevTools to Crawl Webpages

09.05.12 18

Selenium WebDriver

Demo

DevTools

@chrschneider

Page 19: Innoplexia DevTools to Crawl Webpages

09.05.12 19

Selenium WebDriver Pitfalls

Newbie Pitfalls:

● Selenium doesn't wait until the hole site is loaded (Keyword: Implicit wait)● webElement.xPath(“@// ...“) starts from root of the DOM (use “.//...“ instead)● Google brings up “Selenium RC“ solutions. This is the old Selenium project.● A reference to a WebElement will become invalid if the driver “moves“ to

another page.● Firefox doesn't run on our CI because it is a headless system (try Xvfb)● New XPath 2.0 functions (like ends-with(...)) are failing. This is because Selenium

uses the driver's native Xpath engine. For Firefox this means it is Xpath 1.0 today.

DevTools

@chrschneider

Page 20: Innoplexia DevTools to Crawl Webpages

Noch Fragen?Vielen Dank für Ihre Aufmerksamkeit!