Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Your Email:
Webmaster Blog

Easy Screen Scraping in PHP with the Simple HTML DOM Library

by Akash Mehta


Client-side developers always had it easy - libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground - the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.

The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.

Here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/');
 
foreach($html->find('a') as $element)
    echo $element->href;

This snippet is fairly self explanatory - file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements - in this case, anchors - and iterate over them just as we would with PHP 5’s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)

But the library doesn’t stop there - as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:

$html = str_get_html('
<div id="hello">Hello</div>
<div id="world">World</div>
 
');
 
$html->find('div', 1)->class = 'bar';
 
$html->find('div[id=hello]', 0)->innertext = 'foo';

The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node - children(), parent(), first_child() and so on.

Real scraping? Easy. Here’s their Slashdot sample:

$html = file_get_html('http://slashdot.org/');
 
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}
 
print_r($articles);

And finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.




Related Posts
» Learn regular expressions in PHP
» Paul Reinheimer’s PHP Contest
» Parallel web scraping in PHP: cURL multi functions
» Maintaining history in AJAX applications
» Debugging PHP with Firebug and FirePHP
 


This post has 9 Responses so far.
  1. David Mobley Says:
    August 6th, 2008 at 11:00 am

    How do set this up with a server running locally? I don’t understand how to install the simplehtmldom library.

  2. WC Says:
    August 6th, 2008 at 11:03 am

    Neat library.

    BTW…
    foreach($html->find(’a’ as $element))
    should be
    foreach($html->find(’a') as $element)
    no?

  3. Matthew Turland Says:
    August 6th, 2008 at 9:07 pm

    @WC Yes.

    While I like several things about jQuery, I have to say that the syntax offers somewhat of a sharp learning curve for those not already familiar with CSS selectors. I have my own library similar to the Simple HTML DOM Parser project that offers similar functionality programmatically through the API rather than having the overhead of an expression parser in PHP userland for the sake of extreme concision in API calls. You can find my library here: http://svn.assembla.com/svn/php_domquery/trunk. The vast majority of the code is covered by unit tests and all API methods are documented using phpDoc-style docblocks.

  4. Mohsen Says:
    August 7th, 2008 at 12:42 am

    Thanks! was looking for an HTML scrapper

  5. Akash Mehta Says:
    August 7th, 2008 at 1:39 am

    Thanks WC.

    @David: Just download the library .zip file, extract the files, locate the library .php file somewhere readable on your server and include it. Also check out the manual included in the download.

    @Matthew: Interesting point, and thanks for mentioning DOM Parser. As far as I can see, simplehtmldom is just a wrapper around the DOM APIs in PHP 5, but it’s a very high level of abstraction and it does a lot in between (such as all these complex selectors). There are some regular expressions with considerable overhead, but it’s still reasonably efficient.

  6. drew Says:
    August 8th, 2008 at 1:20 am

    mmm, tasty. i’ve been looking for something like this.

  7. Mohammad Says:
    August 8th, 2008 at 6:47 am

    Nice, I should try it.
    I used htmlsql [http://www.jonasjohn.de/lab/htmlsql.htm] and it was good; very fast and easy to learn and write code; LIKE sql “SELECT href FROM a WHERE $class=’newsLinks’ “.
    the only problem was that it could not parse the invalid html; for example ()

  8. Ian Says:
    August 11th, 2008 at 12:39 pm

    This looks promising. Why not try to get this into PECL or PEAR?

  9. seo Says:
    September 5th, 2008 at 4:56 am

    simple and brief intro, nice work there!

Leave a Reply

Ask A Question
characters left.