17
PHP Tutorial 2: Advanced Data Scraping Using cURL And XPATH Matthew Watts Tutorials 2010-12-18 Ever wanted to get a list of information such as URLs, Articles, tabular data, or whatever else that you know is on one website or across multiple websites, then manipulate it to reuse elsewhere? Stop wondering, because we are about to get down to business! There’re many ways to scrape / mine data, but I’ve found that the easiest and most efficient way is to use a combination of cURL and XPATH. cURL is neat because it will easily let you use proxies, manipulate browser information, catch errors, etc. XPATH is great because you don’t need to write a bunch of regular expressions or other functions to manage the data – you just manipulate the DOM tree with a single string and you can get the group of elements that you want in an array. Both of these modules are available in nearly every programming language, so your code, if written correctly, can easily be ported from one language to another. What We’ll Touch On Helpful Scraping Tools Pre Coding Writing the Script Final Thoughts Download the Script Before We Begin Our Goal I called this post “advanced data scraping” because we are going to be doing more than just getting a small bit of information. We are going to be crawling multiple pages, gathering URLs, and then parsing those URLs and scraping data on those pages. The techniques taught here are used by many programmers for a variety of reasons, and will even let you scrape pages rendered entirely in Javascript (such as Google)! So, we’ll start with a generic site meeting that criteria, such as EzineArticles. Our Tools If you don’t have Firebug yet, you need it. It’s amazing for quickly checking the elements of a page, especially when setting up XPATH queries. You’ll

Advanced Data Scraping Using Curl & Xpath

Embed Size (px)

Citation preview

Page 1: Advanced Data Scraping Using Curl & Xpath

PHP Tutorial 2: Advanced Data Scraping Using cURL And XPATHMatthew Watts Tutorials 2010-12-18

Ever wanted to get a list of information such as URLs, Articles, tabular data, or whatever else that you know is on one website or across multiple websites, then manipulate it to reuse elsewhere? Stop wondering, because we are about to get down to business!

There’re many ways to scrape / mine data, but I’ve found that the easiest and most efficient way is to use a combination of cURL and XPATH. cURL is neat because it will easily let you use proxies, manipulate browser information, catch errors, etc. XPATH is great because you don’t need to write a bunch of regular expressions or other functions to manage the data – you just manipulate the DOM tree with a single string and you can get the group of elements that you want in an array. Both of these modules are available in nearly every programming language, so your code, if written correctly, can easily be ported from one language to another.

What We’ll Touch On• Helpful Scraping Tools • Pre Coding • Writing the Script • Final Thoughts • Download the Script

Before We Begin

Our GoalI called this post “advanced data scraping” because we are going to be doing more than just getting a small bit of information. We are going to be crawling multiple pages, gathering URLs, and then parsing those URLs and scraping data on those pages. The techniques taught here are used by many programmers for a variety of reasons, and will even let you scrape pages rendered entirely in Javascript (such as Google)!

So, we’ll start with a generic site meeting that criteria, such as EzineArticles.

Our ToolsIf you don’t have Firebug yet, you need it. It’s amazing for quickly checking the elements of a page, especially when setting up XPATH queries. You’ll

Page 2: Advanced Data Scraping Using Curl & Xpath

also want to grab Xpath Checker to quickly see if your XPATH queries are selecting the correct elements. Both are FireFox extensions.

Pre CodingBefore we start coding, we want to gather our XPATH queries to make sure we can select the data we want and get a feel for how we’re going to program the scraper.

First, we need to figure out what section of EzineArticles we’re going to scrape. I think Business >> Ethics is a suitable section. We have two options here, only scrape the RSS feed, or scrape the pages directly. For this tutorial, we are scraping the pages directly so that we can get all the articles, not just the 100 or so most recent. It also lets you see how you would scrape sites with pagination (a section broken up into multiple sections of 10, 20, 30, etc.).

Category Section Page

EzineArticles Article Snippits DOM

In order to see how you want to go about selecting sections of the DOM with XPATH, you need to look at the code. Firebug is great for this.

On the section page, the XPATH query to get the information we want is this://div[@class='ea-category-list']/ol/liThat query will get an array of every list element within a div that contains the “ea-category-list” class. The list items contain a link to the article, a short description, and the author; so, the only thing we lack now to complete all the article information is the article itself.

If you want a more in depth explanation, the plugin page for Xpath Checker has some, as well as this page about Xpath from W3Schools.

While we are still on the section page, we might as well get the XPATH query that will check if there are still more pages to the section so we can continue grabbing URLs.//div[@class='ea-category-list']/p[@class='title']/a[text()='Next 30']/@href

Page 3: Advanced Data Scraping Using Curl & Xpath

This XPATH query is a little different, because we need to check if there is a link for “Next 30″. This type of situation is what makes XPATH such an excellent choice for scraping. Now checking if there are more pages is as simple as just popping in an “if” statement to see if there are elements in an array. In addition, I added “@href” at the end, which will only return the URL of the link to save some coding.

On to the Article Page.On the article page, the XPATH query to get the information we want is this:id('body')This one was simple because EzineArticles puts all their articles into a div with an id.

Writing the ScriptNow that we have the XPATH queries, we can begin coding. We’ll want to make this into a class, because you never know if you will want to reuse the script later. We might as well go ahead and take a few additional minutes so that we can use it for “any” section on EzineArticles.

The ConstructorPHP Classes, like most object-oriented programming, allow you to make a constructor that will fire certain actions once the class has been instantiated. This is especially useful for our needs so that we can just have it do stuff once we pass it a URL at instantiation.class Scraper { protected $articles = array(); protected $domain;

// Set actions to run when the class is instantiated function __construct($url){ // Set the maximum execution time of the script to unlimited so that it can grab all the articles if there are a lot of them to scrape set_time_limit(0);

// Set the root domain of the URL to concatinate with URLs later $this->domain = explode("/", $url); $this->domain = 'http://' . $this->domain[2];

// Pass the page URL you want to start scraping and start scraping through the section pages $this->getArticleUrls($url);

echo count($this->articles) . ' - Done counting articles items, now adding articles.

Page 4: Advanced Data Scraping Using Curl & Xpath

';

// Loop through the article pages and grab the full article to finish populating the articles array with data foreach ($this->articles as $item){ $item['article'] = $this->getArticles($item['url']); }

echo count($this->articles) . ' - Done adding articles.';

// Add function here to start adding items in the article array with articles to a database }

The above block of code basically creates a constructor that will do three things (minus the print outs to the screen which can be eliminated): take the given URL and get the root domain name (since EzineArticles uses relative URLs instead of the full URL for it’s article pages), call the getArticleURLs function, and call the getArticles function. So, when we instantiate the Scraper class those three things will happen without any more interaction needed.

Get Article URLs Function // Start Get Article Urls private function getArticleUrls($url){ // Instantiate next page variable to check at the end $nextPageUrl = NULL;

// Instantiate cURL to grab the HTML page. $c = curl_init($url); curl_setopt($c, CURLOPT_HEADER, false); curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent()); curl_setopt($c, CURLOPT_FAILONERROR, true); curl_setopt($c, CURLOPT_FOLLOWLOCATION, true); curl_setopt($c, CURLOPT_AUTOREFERER, true); curl_setopt($c, CURLOPT_RETURNTRANSFER, true); curl_setopt($c, CURLOPT_TIMEOUT, 10); // Add curl_setopt here to grab a proxy from your proxy list so that you don't get 403 errors from your IP being banned by the site

// Grab the data. $html = curl_exec($c);

Page 5: Advanced Data Scraping Using Curl & Xpath

This part of the function Sets up cURL. It’s basically being told to not return header information, call the function I set to randomly show a common browser’s user agent, fail if it has an error, manage the referer itself, return the page instead of show it once it goes to the URL, and time out after 10 seconds so we won’t be waiting forever if the site is slow.

I’m not going to get into how to setup and use proxies with cURL, but it’d be wise if you did that since EzineArticles, and many high-traffic websites, will limit the number of requests per second or ban IP addresses if they get a lot of traffic from a single IP. It’s fairly simple to have a text file of proxies and randomly choose one to use with cURL.

The last line executes cURL with the above mentioned options and stores it in a variable named $html. // Check if the HTML didn't load right, if it didn't - report an error if (!$html) { echo "

cURL error number: " .curl_errno($c) . " on URL: " . $url ."

" . "

cURL error: " . curl_error($c) . "

"; }

// Close connection. curl_close($c);

// Parse the HTML information and return the results. $dom = new DOMDocument(); @$dom->loadHtml($html);

$xpath = new DOMXPath($dom);

// Get a list of articles from the section page $articleList = $xpath->query("//div[@class='ea-category-list']/ol/li");

// Add each article to the Articles array foreach ($articleList as $item){ $this->articles[] = array( 'url' => $this->domain . $item->getElementsByTagName('a')->item(0)->getAttribute('href'), 'title' => $item-

Page 6: Advanced Data Scraping Using Curl & Xpath

>getElementsByTagName('a')->item(0)->nodeValue, 'author' => $item->getElementsByTagName('em')->item(0)->getElementsByTagName('a')->item(0)->nodeValue, 'description' => $item->getElementsByTagName('div')->item(0)->nodeValue, 'article' => '' ); }

The above block of code will output errors that cURL has while parsing the pages and then close the connection. It’s good practice to close a connection as soon as you don’t need it anymore. It frees up memory and CPU on your server.

Next, a DOM object is instantiated, then the page information we got through cURL is broken up into DOM elements that we can traverse with XPATH, which is what is instantiated next.

The XPATH query we setup earlier is plugged into the code and that little bit of the page is stored in $articleList for us to easily use in a foreach loop to grab the bits of information we need and store it in our $articles array.

In case you hadn’t noticed, I’m storing an associative array for each article. This makes it super easy to grab the specific information I want about the article later in the code and if I decided to store the array information into a database later.

Also, take note about how I’m setting the information for each array element. Each DOM element that I call $item in the foreach loop can be accessed just like you would if you’re familiar with JavaScript. It’s sharing the same DOM model for accessing each element, tag, and attribute – which is another reason that XPATH is so powerful and easy to use. // Check to see if the Next 30 link is active $nextPageUrl = $xpath->query("//div[@class='ea-category-list']/p[@class='title']/a[text()='Next 30']/@href");

if ($nextPageUrl){ $nextPageUrl = $nextPageUrl->item(0)->nodeValue;

// If there is a next page, go to it. if (isset($nextPageUrl) && $nextPageUrl != ""){ $this->getArticleUrls($nextPageUrl); } } } // End Get Article Urls

This last bit of the function checks to see if there’s a link around “Next 30″. If there is, it will call the getArticleUrls function again. This is called

Page 7: Advanced Data Scraping Using Curl & Xpath

recursion. Recursive programing is very powerful because it allows you to keep using the same code again and again until it’s done. The function is its own loop!

Get Article FunctionI’m not going to display this one since it contains code that’s been touched on earlier: setting up and executing cUrl and XPATH.

Final ThoughtsScraping data is a very powerful knowledge and skill to have. There’s a ton of ways to use scraped data: sell databases, use articles gotten from sites and spin them to use on your Web 2.0 or blog properties, etc. The possibilities are endless and only limited by your imagination.

Keep in mind, not all scraping is the same. Manipulating data to reuse and do stuff with (such as grabbing a list of sports stats from ESPN) is not the same as grabbing an article off EzineArticles or someone’s site and showing it on your own site. The later is copyright infringement, so be smart about what you do with scraped data.

I used EzineArticles as an example, I’m not advocating ripping their articles without you giving credit to the author and linking back to the main article and leaving their links in tact.

Download the ScriptYou can get the scraping script in a text file. Have fun and go make some money!<?php/*=================================================== * Title: EzineArticles Scraping Class * For: Scraping EzineArticles and managing the data * Author: Matthew Watts - http://www.matthewwatts.net * Date Created: 2010-12-18 * Last Modified by: Matthew Watts * Last Modified: 2010-12-18===================================================*/

$scrape = new Scraper('http://ezinearticles.com/?cat=Business:Ethics');

class Scraper { protected $articles = array(); protected $domain;

Page 8: Advanced Data Scraping Using Curl & Xpath

// Set actions to run when the class is instantiated function __construct($url){ // Set the maximum execution time of the script to unlimited so that it can grab all the articles if there are a lot of them to scrape set_time_limit(0); // Set the root domain of the URL to concatinate with URLs later $this->domain = explode("/", $url); $this->domain = 'http://' . $this->domain[2]; // Pass the page URL you want to start scraping and start scraping through the section pages $this->getArticleUrls($url); echo count($this->articles) . ' - Done counting articles items, now adding articles.<br>'; // Loop through the article pages and grab the full article to finish populating the articles array with data foreach ($this->articles as $item){ $item['article'] = $this->getArticles($item['url']); } echo count($this->articles) . ' - Done adding articles.'; // Add function here to start adding items in the article array with articles to a database } // Start Get Article Urls private function getArticleUrls($url){ // Instantiate next page variable to check at the end $nextPageUrl = NULL; // Instantiate cURL to grab the HTML page. $c = curl_init($url); curl_setopt($c, CURLOPT_HEADER, false); curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent()); curl_setopt($c, CURLOPT_FAILONERROR, true); curl_setopt($c, CURLOPT_FOLLOWLOCATION, true); curl_setopt($c, CURLOPT_AUTOREFERER, true); curl_setopt($c, CURLOPT_RETURNTRANSFER, true); curl_setopt($c, CURLOPT_TIMEOUT, 10); // Add curl_setopt here to grab a proxy from your proxy list so that you don't get 403 errors from your IP being banned by the site

Page 9: Advanced Data Scraping Using Curl & Xpath

// Grab the data. $html = curl_exec($c); // Check if the HTML didn't load right, if it didn't - report an error if (!$html) { echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" . "<p>cURL error: " . curl_error($c) . "</p>"; } // Close connection. curl_close($c); // Parse the HTML information and return the results. $dom = new DOMDocument(); @$dom->loadHtml($html); $xpath = new DOMXPath($dom); // Get a list of articles from the section page $articleList = $xpath->query("//div[@class='ea-category-list']/ol/li"); // Add each article to the Articles array foreach ($articleList as $item){ $this->articles[] = array( 'url' => $this->domain . $item->getElementsByTagName('a')->item(0)->getAttribute('href'), 'title' => $item->getElementsByTagName('a')->item(0)->nodeValue, 'author' => $item->getElementsByTagName('em')->item(0)->getElementsByTagName('a')->item(0)->nodeValue, 'description' => $item->getElementsByTagName('div')->item(0)->nodeValue, 'article' => '' ); } // Check to see if the Next 30 link is active $nextPageUrl = $xpath->query("//div[@class='ea-category-list']/p[@class='title']/a[text()='Next 30']/@href"); if ($nextPageUrl){

Page 10: Advanced Data Scraping Using Curl & Xpath

$nextPageUrl = $nextPageUrl->item(0)->nodeValue; // If there is a next page, go to it. if (isset($nextPageUrl) && $nextPageUrl != ""){ $this->getArticleUrls($nextPageUrl); } } } // End Get Article Urls //Start Get Articles private function getArticles($url){ // Instantiate cURL to grab the HTML page. $c = curl_init($url); curl_setopt($c, CURLOPT_HEADER, false); curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent()); curl_setopt($c, CURLOPT_FAILONERROR, true); curl_setopt($c, CURLOPT_FOLLOWLOCATION, true); curl_setopt($c, CURLOPT_AUTOREFERER, true); curl_setopt($c, CURLOPT_RETURNTRANSFER, true); curl_setopt($c, CURLOPT_TIMEOUT, 10); // Grab the data. $html = curl_exec($c); // Check if the HTML didn't load right, if it didn't - report an error if (!$html) { echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" . "<p>cURL error: " . curl_error($c) . "</p>"; } // Close connection. curl_close($c); // Parse the HTML information and return the results. $dom = new DOMDocument(); @$dom->loadHtml($html); $xpath = new DOMXPath($dom); // Get a list of articles from the section page $article = $xpath->query("id('body')");

Page 11: Advanced Data Scraping Using Curl & Xpath

return $article->item(0)->nodeValue; } // End Get Articles // Start Get Browser User Agent private function getUserAgent(){ // Set an array with different browser user agents $agents = array( "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; bgft)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; GTB5; User-agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; http://bsalsa.com) ; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Tablet PC 2.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Orange 8.0; GTB6.3; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Embedded Web Browser from: http://bsalsa.com/; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3; OfficeLivePatch.1.3)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 1.1.4322; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/4.0 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.8) Gecko/20100722 BTRS86393 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0C)", "Mozilla/5.0 (Windows; U; MSIE 9.0; WIndows NT 9.0; en-US)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; Media Center PC 6.0; InfoPath.3; MS-RTC LM 8; Zune 4.7)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Zune 4.0; InfoPath.3; MS-RTC LM 8; .NET4.0C; .NET4.0E)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Zune 4.0; Tablet PC 2.0; InfoPath.3; .NET4.0C; .NET4.0E)", "Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 5.1; Trident/5.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.0;

Page 12: Advanced Data Scraping Using Curl & Xpath

Media Center PC 4.0; SLCC1; .NET CLR 3.0.04320)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 1.1.4322)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.0; Trident/4.0; InfoPath.1; SV1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 3.0.04506.30)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.2; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; InfoPath.2)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; Zune 3.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; msn OptimizedIE8;ZHCN)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8; InfoPath.3; .NET4.0C; .NET4.0E)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8; .NET4.0C; .NET4.0E; Zune 4.7; InfoPath.3)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; MS-RTC LM 8)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; Zune 4.0)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; OfficeLiveConnector.1.4; OfficeLivePatch.1.3; yie8)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; Zune 3.0; MS-RTC LM 8)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; MS-RTC LM 8; Zune 4.0)",

Page 13: Advanced Data Scraping Using Curl & Xpath

"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; MS-RTC LM 8)", "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.2; FDM; OfficeLiveConnector.1.4; OfficeLivePatch.1.3; .NET CLR 1.1.4322)", "Opera/9.99 (Windows NT 5.1; U; pl) Presto/9.9.9", "Opera/9.80 (J2ME/MIDP; Opera Mini/5.0 (Windows; U; Windows NT 5.1; en) AppleWebKit/886; U; en) Presto/2.4.15", "Opera/9.70 (Linux ppc64 ; U; en) Presto/2.2.1", "Opera/9.70 (Linux i686 ; U; zh-cn) Presto/2.2.0", "Opera/9.70 (Linux i686 ; U; en-us) Presto/2.2.0", "Opera/9.70 (Linux i686 ; U; en) Presto/2.2.1", "Opera/9.70 (Linux i686 ; U; en) Presto/2.2.0", "Opera/9.70 (Linux i686 ; U; ; en) Presto/2.2.1", "Opera/9.70 (Linux i686 ; U; ; en) Presto/2.2.1", "Mozilla/5.0 (Linux i686 ; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.70", "Mozilla/4.0 (compatible; MSIE 6.0; Linux i686 ; en) Opera 9.70", "Opera/9.64(Windows NT 5.1; U; en) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; pl) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; hr) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; en-GB) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; de) Presto/2.1.1", "Opera/9.64 (X11; Linux x86_64; U; cs) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; tr) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; sv) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; pl) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; nb) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; Linux Mint; nb) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; Linux Mint; it) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; en) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; de) Presto/2.1.1", "Opera/9.64 (X11; Linux i686; U; da) Presto/2.1.1", "Opera/9.64 (Windows NT 6.1; U; MRA 5.5 (build 02842); ru) Presto/2.1.1", "Opera/9.64 (Windows NT 6.1; U; de) Presto/2.1.1", "Opera/9.64 (Windows NT 6.0; U; zh-cn) Presto/2.1.1", "Opera/9.64 (Windows NT 6.0; U; pl) Presto/2.1.1", "Opera 9.7 (Windows NT 5.2; U; en)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-HK)

Page 14: Advanced Data Scraping Using Curl & Xpath

AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Windows; U; Windows NT 6.0; tr-TR) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Windows; U; Windows NT 6.0; nb-NO) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Windows; U; Windows NT 6.0; fr-FR) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; ru-RU) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; zh-cn) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; de-de) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_2_1 like Mac OS X; da-dk) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; ja-jp) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", "Mozilla/5.0 (X11; U; Linux x86_64; en-ca) AppleWebKit/531.2+ (KHTML, like Gecko) Version/5.0 Safari/531.2+", "Mozilla/5.0 (Windows; U; Windows NT 6.1; ja-JP) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Windows; U; Windows NT 6.1; es-ES) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Windows; U; Windows NT 6.0; ja-JP) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_5_8; ja-jp) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_4_11; fr) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; zh-cn) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; ru-ru) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; ko-kr) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; it-it) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-us)

Page 15: Advanced Data Scraping Using Curl & Xpath

AppleWebKit/534.1+ (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; en-au) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; el-gr) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_3; ca-es) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; zh-tw) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; ja-jp) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; it-it) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; fr-fr) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; es-es) AppleWebKit/533.16 (KHTML, like Gecko) Version/5.0 Safari/533.16" ); return $agents[rand(0, (count($agents)-1))]; } // End Get Browser User Agent}// End Scraper Class?>

Bookmark It

Hide Sites

data mining, programming, scraping data, web content, web scraping

Page 16: Advanced Data Scraping Using Curl & Xpath

Address: http://www.matthewwatts.net/tutorials/php-tutorial-2-advanced-data-scraping-using-curl-and-xpath/ « How Well Do Your Landing Pages Convert?Dropshipping: The Future of eCommerce and Product Marketing »Trackback

3 comments untill now1. Rifanoz @ 2011-07-13 00:06

Thanks for great codes.

There’s error on$articleList = $xpath->query(“//div[@class='ea-category-list']/ol/li”)and$article = $xpath->query(“id(‘body’)”);

Ezine has changed it.

I’ve fixed it.$articleList = $xpath->query(“//div[@class='category-list']/div[@class='article']“);

$article = $xpath->query(“id(‘article-content’)”);

I hope it useful.

2. nonyck @ 2011-07-19 18:12

You didnt showed how to parse javascript content generated w php……

3. Matthew Watts @ 2011-08-11 16:44

I didn’t show it because it’s not relevant to the tutorial. The only decent way to parse Javascript with any consistency is using a browser emulator such as watir for Ruby or selenium for Python and Java. You can do it with PHP, but it’s not consistent. In any event, it still involves parsing the DOM the same way as here using something such as XPATH, you just have to refresh the DOM you’ve gathered every time you

Page 17: Advanced Data Scraping Using Curl & Xpath

make an action to see changes made through the JS on the page.