Web Development

Easy Screen Scraping in PHP with the Simple HTML DOM Library

Client-side developers always had it easy – libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground – the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.

The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.

Here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/');

foreach($html->find('a') as $element)
    echo $element->href;

This snippet is fairly self explanatory – file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements – in this case, anchors – and iterate over them just as we would with PHP 5′s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)

But the library doesn’t stop there – as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:

$html = str_get_html('
Hello
World
'); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo';

The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node – children(), parent(), first_child() and so on.

Real scraping? Easy. Here’s their Slashdot sample:

$html = file_get_html('http://slashdot.org/');

foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

And finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.

About the author

Written by .

If you found this post useful you may also want to check these out:

  1. Screen Scraping your Way into RSS
  2. RSS feeds in PHP: 3 simple steps to PHP RSS generation
  3. Tip: Convert from HTML to XML with HTML Tidy
  4. Scraping Links With PHP
  5. Parallel web scraping in PHP: cURL multi functions
  6. HTML 5 On It’s Way
  • David Mobley

    How do set this up with a server running locally? I don’t understand how to install the simplehtmldom library.

  • WC

    Neat library.

    BTW…
    foreach($html->find(‘a’ as $element))
    should be
    foreach($html->find(‘a’) as $element)
    no?

  • http://ishouldbecoding.com Matthew Turland

    @WC Yes.

    While I like several things about jQuery, I have to say that the syntax offers somewhat of a sharp learning curve for those not already familiar with CSS selectors. I have my own library similar to the Simple HTML DOM Parser project that offers similar functionality programmatically through the API rather than having the overhead of an expression parser in PHP userland for the sake of extreme concision in API calls. You can find my library here: http://svn.assembla.com/svn/php_domquery/trunk. The vast majority of the code is covered by unit tests and all API methods are documented using phpDoc-style docblocks.

  • http://acomment.net Mohsen

    Thanks! was looking for an HTML scrapper

  • http://bitmeta.org/ Akash Mehta

    Thanks WC.

    @David: Just download the library .zip file, extract the files, locate the library .php file somewhere readable on your server and include it. Also check out the manual included in the download.

    @Matthew: Interesting point, and thanks for mentioning DOM Parser. As far as I can see, simplehtmldom is just a wrapper around the DOM APIs in PHP 5, but it’s a very high level of abstraction and it does a lot in between (such as all these complex selectors). There are some regular expressions with considerable overhead, but it’s still reasonably efficient.

  • drew

    mmm, tasty. i’ve been looking for something like this.

  • Mohammad

    Nice, I should try it.
    I used htmlsql [http://www.jonasjohn.de/lab/htmlsql.htm] and it was good; very fast and easy to learn and write code; LIKE sql “SELECT href FROM a WHERE $class=’newsLinks’ “.
    the only problem was that it could not parse the invalid html; for example ()

  • http://isnoop.net Ian

    This looks promising. Why not try to get this into PECL or PEAR?

  • http://www.davidtan.org seo

    simple and brief intro, nice work there!

  • http://www.kiboke-studio.hr/ Daniel

    This is just what I need, thanks!

  • Anar

    Very useful article, but I get problem.
    The webpage that I’m trying to parse has something in their script to avoid the parsing. So if I open it via browser (i.e. Firefox) it shows me the page, but when I’m trying to get HTML with PHP script, I get HTML that says something like (“your IP is bla-bla-bla and UserAgent bla-bla-bla blocked”)
    Does anybody know how to simulate normal browser request with PHP? How they can detect PHP request?
    Thanks in advance,

  • http://www.interview-questions-tips-forum.net TEST

    nice tutorial, Thanks

  • http://www.interview-questions-tips-forum.net TEST

    Can it be treated like some sort of re-usable component?

    Thanks for this great tip

  • Everett

    I can read PHP and modify it from working with CMS like Joomla. But I cannot make this run. Can anyone point me to a tutorial on how to get this to run (get installed) on my web host? Thanks! (running PHP 5.X & MySQL)

  • Brian

    What’s the advantage of using this library over the DOM built into PHP?

  • Bruce

    Although a little late, hopefully Anar might make it back and see this post. If you’re having problems with being blocked because of the UserAgent, you could use curl if your server has it to pull the HTML. curl offers params that allow you to set the UserAgent, set just about any header you want, pass cookies back and forth, follow redirects, etc.

    This class is fantastic. It helped me out instantly on my test and from the first pass, it looks like it’s very easy to use.

  • phpfan

    is there anyway to get the html source of DOM which is generated from ajax?
    {ajax function goes here
    document.getElementById(‘div_id’).innerHTML= something;
    }

    html here

    div id=”div_id”(HERE IS WHAT I WANT TO GET)

    php goes here
    need to do something here to get (HERE IS WHAT I WANT TO GET)….

  • http://webdlabs.com Akshay

    I have created a specific WordPress plugin using phpQuery and cURL to place web scraps inside your wordpress posts, pages or sidebar. Check http://webdlabs.com/projects/wp-web-scraper/ for a demo or download the plugin from http://wordpress.org/extend/plugins/wp-web-scrapper

  • streetparade

    Is there also a function to get the largest text from html body ?

  • http://657484 Chea leb

    /**
    * An example of using remote scraping sessions.
    *
    * @author Brent Wenerstrom1
    */

    using System;
    using System.Collections;
    using Screen scraper;

    public class Data Set Remote Scraping Session Example
    {
    /**
    * The entry point.
    */
    public static void Main( string[&#93args)
    {
    try
    {
    // Create a remote Session to communicate with the server.
    Remote Scraping Session remote Session = new Remote Scraping Session( “Slashdot” );

    // Scraper.
    remote Session.Scrape();

    // Get the data.
    // If the returned value is null an exception will be thrown later.
    Data Set data Set = ( Data Set )remote Session.Get Variable( “DATA SET” );

    // Temporarily holds data records.
    Data Record template Data Record = null;

    Console.Write Line( “============================================================================” );

    // Iterate through the data records.

    for( I Enumerator iterator = data Set.All Data Records.Get Enumerator(); iterator.Move Next(); ) {
    {
    template Data Record = ( Data Record )iterator.Current;

    // Enumerate through the data record, outputting each of the fields.
    for( I Dictionary Enumerator en = template Data Record.Get Enumerator(); en.Move Next(); )
    {
    Console.Write Line( en.Key + “=” + en.Value );
    }
    Console.Write Line( “============================================================================” );
    }

    }
    // Very important! Be sure to disconnect from the server.
    remote Session.Disconnect();
    }
    catch( Exception e )
    {
    Console.Write Line( e );
    }
    }
    }