Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Your Email:
Webmaster Blog

Easy Screen Scraping in PHP with the Simple HTML DOM Library

by Akash Mehta


Client-side developers always had it easy – libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground – the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.

The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.

Here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/');
 
foreach($html->find('a') as $element)
    echo $element->href;

This snippet is fairly self explanatory – file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements – in this case, anchors – and iterate over them just as we would with PHP 5’s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)

But the library doesn’t stop there – as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:

$html = str_get_html('
<div id="hello">Hello</div>
<div id="world">World</div>
 
');
 
$html->find('div', 1)->class = 'bar';
 
$html->find('div[id=hello]', 0)->innertext = 'foo';

The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node – children(), parent(), first_child() and so on.

Real scraping? Easy. Here’s their Slashdot sample:

$html = file_get_html('http://slashdot.org/');
 
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}
 
print_r($articles);

And finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.




Related Posts
» Learn regular expressions in PHP
» Paul Reinheimer’s PHP Contest
» Parallel web scraping in PHP: cURL multi functions
» Maintaining history in AJAX applications
» PHP-friendly web services/APIs for quick mashups
 


This post has 14 Responses so far.
  1. David Mobley Says:
    August 6th, 2008 at 11:00 am

    How do set this up with a server running locally? I don’t understand how to install the simplehtmldom library.

  2. WC Says:
    August 6th, 2008 at 11:03 am

    Neat library.

    BTW…
    foreach($html->find(’a’ as $element))
    should be
    foreach($html->find(’a') as $element)
    no?

  3. Matthew Turland Says:
    August 6th, 2008 at 9:07 pm

    @WC Yes.

    While I like several things about jQuery, I have to say that the syntax offers somewhat of a sharp learning curve for those not already familiar with CSS selectors. I have my own library similar to the Simple HTML DOM Parser project that offers similar functionality programmatically through the API rather than having the overhead of an expression parser in PHP userland for the sake of extreme concision in API calls. You can find my library here: http://svn.assembla.com/svn/php_domquery/trunk. The vast majority of the code is covered by unit tests and all API methods are documented using phpDoc-style docblocks.

  4. Mohsen Says:
    August 7th, 2008 at 12:42 am

    Thanks! was looking for an HTML scrapper

  5. Akash Mehta Says:
    August 7th, 2008 at 1:39 am

    Thanks WC.

    @David: Just download the library .zip file, extract the files, locate the library .php file somewhere readable on your server and include it. Also check out the manual included in the download.

    @Matthew: Interesting point, and thanks for mentioning DOM Parser. As far as I can see, simplehtmldom is just a wrapper around the DOM APIs in PHP 5, but it’s a very high level of abstraction and it does a lot in between (such as all these complex selectors). There are some regular expressions with considerable overhead, but it’s still reasonably efficient.

  6. drew Says:
    August 8th, 2008 at 1:20 am

    mmm, tasty. i’ve been looking for something like this.

  7. Mohammad Says:
    August 8th, 2008 at 6:47 am

    Nice, I should try it.
    I used htmlsql [http://www.jonasjohn.de/lab/htmlsql.htm] and it was good; very fast and easy to learn and write code; LIKE sql “SELECT href FROM a WHERE $class=’newsLinks’ “.
    the only problem was that it could not parse the invalid html; for example ()

  8. Ian Says:
    August 11th, 2008 at 12:39 pm

    This looks promising. Why not try to get this into PECL or PEAR?

  9. seo Says:
    September 5th, 2008 at 4:56 am

    simple and brief intro, nice work there!

  10. Daniel Says:
    February 25th, 2009 at 10:34 am

    This is just what I need, thanks!

  11. Anar Says:
    March 21st, 2009 at 4:25 am

    Very useful article, but I get problem.
    The webpage that I’m trying to parse has something in their script to avoid the parsing. So if I open it via browser (i.e. Firefox) it shows me the page, but when I’m trying to get HTML with PHP script, I get HTML that says something like (”your IP is bla-bla-bla and UserAgent bla-bla-bla blocked”)
    Does anybody know how to simulate normal browser request with PHP? How they can detect PHP request?
    Thanks in advance,

  12. TEST Says:
    March 25th, 2009 at 1:16 pm

    nice tutorial, Thanks

  13. TEST Says:
    March 25th, 2009 at 1:17 pm

    Can it be treated like some sort of re-usable component?

    Thanks for this great tip

  14. Everett Says:
    April 7th, 2009 at 8:49 am

    I can read PHP and modify it from working with CMS like Joomla. But I cannot make this run. Can anyone point me to a tutorial on how to get this to run (get installed) on my web host? Thanks! (running PHP 5.X & MySQL)

Leave a Reply

Ask A Question
characters left.