• Home

Logo

Navigation
  • Home
  • Articles
    • Content Writing
    • Design
    • General
    • Internet Marketing
    • Social Media
    • Tools and Tips
    • Usability
    • Web Hosting Articles
  • Tutorials
    • AJAX Tutorials
    • ASP Tutorials
    • C# Tutorials
    • CGI and Perl Tutorials
    • CSS Tutorials
    • Flash Tutorials
    • HTML Tutorials
    • Illustrator Tutorials
    • Java Tutorials
    • JavaScript Tutorials
    • Linux Tutorials
    • Miscellaneous Tutorials
    • MySQL Tutorials
    • Photoshop Tutorials
    • PHP Tutorials
    • Python Tutorials
    • Wireless Tutorials
    • WordPress Tutorials
    • XML Tutorials
  • Scripts
    • AJAX Scripts
    • ASP Scripts
    • ASP.NET Scripts
    • CGI & Perl Scripts
    • Flash Scripts
    • Java Scripts
    • JavaScript Scripts
    • PHP Scripts
    • Python Scripts
    • Remotely Hosted
    • Tools and Utilities
    • XML Scripts
  • Answers
  • Online Services
  • Tools

Easy Screen Scraping in PHP with the Simple HTML DOM Library

By Akash Mehta | on Aug 6, 2008 | 20 Comments
PHP Tutorials
  • Tweet
  • Share
  • Tweet
  • Share

Client-side developers always had it easy – libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground – the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.

The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.

Here’s a sample of simplehtmldom in action:

$html = file_get_dom('http://www.google.com/');

foreach($html->find('a') as $element)
    echo $element->href;

This snippet is fairly self explanatory – file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements – in this case, anchors – and iterate over them just as we would with PHP 5′s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)

But the library doesn’t stop there – as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:

$html = str_get_html('
Hello
World
'); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo';

The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node – children(), parent(), first_child() and so on.

Real scraping? Easy. Here’s their Slashdot sample:

$html = file_get_html('http://slashdot.org/');

foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

And finally, there’s always a simple save mechanism:

$html->save('altered-dom.html');

Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.

Share this story:
  • tweet

Author Description

20 Responses to “Easy Screen Scraping in PHP with the Simple HTML DOM Library”

  1. August 6, 2008

    David Mobley Log in to Reply

    How do set this up with a server running locally? I don’t understand how to install the simplehtmldom library.

  2. August 6, 2008

    WC Log in to Reply

    Neat library.

    BTW…
    foreach($html->find(‘a’ as $element))
    should be
    foreach($html->find(‘a’) as $element)
    no?

  3. August 6, 2008

    Matthew Turland Log in to Reply

    @WC Yes.

    While I like several things about jQuery, I have to say that the syntax offers somewhat of a sharp learning curve for those not already familiar with CSS selectors. I have my own library similar to the Simple HTML DOM Parser project that offers similar functionality programmatically through the API rather than having the overhead of an expression parser in PHP userland for the sake of extreme concision in API calls. You can find my library here: http://svn.assembla.com/svn/php_domquery/trunk. The vast majority of the code is covered by unit tests and all API methods are documented using phpDoc-style docblocks.

  4. August 7, 2008

    Mohsen Log in to Reply

    Thanks! was looking for an HTML scrapper

  5. August 7, 2008

    Akash Mehta Log in to Reply

    Thanks WC.

    @David: Just download the library .zip file, extract the files, locate the library .php file somewhere readable on your server and include it. Also check out the manual included in the download.

    @Matthew: Interesting point, and thanks for mentioning DOM Parser. As far as I can see, simplehtmldom is just a wrapper around the DOM APIs in PHP 5, but it’s a very high level of abstraction and it does a lot in between (such as all these complex selectors). There are some regular expressions with considerable overhead, but it’s still reasonably efficient.

  6. August 8, 2008

    drew Log in to Reply

    mmm, tasty. i’ve been looking for something like this.

  7. August 8, 2008

    Mohammad Log in to Reply

    Nice, I should try it.
    I used htmlsql [http://www.jonasjohn.de/lab/htmlsql.htm] and it was good; very fast and easy to learn and write code; LIKE sql “SELECT href FROM a WHERE $class=’newsLinks’ “.
    the only problem was that it could not parse the invalid html; for example ()

  8. August 11, 2008

    Ian Log in to Reply

    This looks promising. Why not try to get this into PECL or PEAR?

  9. September 5, 2008

    seo Log in to Reply

    simple and brief intro, nice work there!

  10. February 25, 2009

    Daniel Log in to Reply

    This is just what I need, thanks!

  11. March 21, 2009

    Anar Log in to Reply

    Very useful article, but I get problem.
    The webpage that I’m trying to parse has something in their script to avoid the parsing. So if I open it via browser (i.e. Firefox) it shows me the page, but when I’m trying to get HTML with PHP script, I get HTML that says something like (“your IP is bla-bla-bla and UserAgent bla-bla-bla blocked”)
    Does anybody know how to simulate normal browser request with PHP? How they can detect PHP request?
    Thanks in advance,

  12. March 25, 2009

    TEST Log in to Reply

    nice tutorial, Thanks

  13. March 25, 2009

    TEST Log in to Reply

    Can it be treated like some sort of re-usable component?

    Thanks for this great tip

  14. April 7, 2009

    Everett Log in to Reply

    I can read PHP and modify it from working with CMS like Joomla. But I cannot make this run. Can anyone point me to a tutorial on how to get this to run (get installed) on my web host? Thanks! (running PHP 5.X & MySQL)

  15. April 17, 2009

    Brian Log in to Reply

    What’s the advantage of using this library over the DOM built into PHP?

  16. May 1, 2009

    Bruce Log in to Reply

    Although a little late, hopefully Anar might make it back and see this post. If you’re having problems with being blocked because of the UserAgent, you could use curl if your server has it to pull the HTML. curl offers params that allow you to set the UserAgent, set just about any header you want, pass cookies back and forth, follow redirects, etc.

    This class is fantastic. It helped me out instantly on my test and from the first pass, it looks like it’s very easy to use.

  17. May 2, 2009

    phpfan Log in to Reply

    is there anyway to get the html source of DOM which is generated from ajax?
    {ajax function goes here
    document.getElementById(‘div_id’).innerHTML= something;
    }

    html here

    div id=”div_id”(HERE IS WHAT I WANT TO GET)

    php goes here
    need to do something here to get (HERE IS WHAT I WANT TO GET)….

  18. August 4, 2009

    Akshay Log in to Reply

    I have created a specific WordPress plugin using phpQuery and cURL to place web scraps inside your wordpress posts, pages or sidebar. Check http://webdlabs.com/projects/wp-web-scraper/ for a demo or download the plugin from http://wordpress.org/extend/plugins/wp-web-scrapper

  19. August 17, 2009

    streetparade Log in to Reply

    Is there also a function to get the largest text from html body ?

  20. January 31, 2010

    Chea leb Log in to Reply

    /**
    * An example of using remote scraping sessions.
    *
    * @author Brent Wenerstrom1
    */

    using System;
    using System.Collections;
    using Screen scraper;

    public class Data Set Remote Scraping Session Example
    {
    /**
    * The entry point.
    */
    public static void Main( string[&#93args)
    {
    try
    {
    // Create a remote Session to communicate with the server.
    Remote Scraping Session remote Session = new Remote Scraping Session( “Slashdot” );

    // Scraper.
    remote Session.Scrape();

    // Get the data.
    // If the returned value is null an exception will be thrown later.
    Data Set data Set = ( Data Set )remote Session.Get Variable( “DATA SET” );

    // Temporarily holds data records.
    Data Record template Data Record = null;

    Console.Write Line( “============================================================================” );

    // Iterate through the data records.

    for( I Enumerator iterator = data Set.All Data Records.Get Enumerator(); iterator.Move Next(); ) {
    {
    template Data Record = ( Data Record )iterator.Current;

    // Enumerate through the data record, outputting each of the fields.
    for( I Dictionary Enumerator en = template Data Record.Get Enumerator(); en.Move Next(); )
    {
    Console.Write Line( en.Key + “=” + en.Value );
    }
    Console.Write Line( “============================================================================” );
    }

    }
    // Very important! Be sure to disconnect from the server.
    remote Session.Disconnect();
    }
    catch( Exception e )
    {
    Console.Write Line( e );
    }
    }
    }

You must be logged in to post a comment.

Connect With Us

RSSSubscribe 0Followers 493Likes
  • Popular
  • Recent
  • Comments
  • Creating Energy Spheres in Photoshop

    Apr 15, 2008 - 96 Comments
  • Calculating date difference more precisely in PHP

    Mar 7, 2008 - 13 Comments
  • Extracting text from Word Documents via PHP and COM

    Mar 14, 2008 - 12 Comments
  • When Does Hosting Your Website in the Cloud Make Sense?

    Oct 8, 2010 - 2 Comments
  • Fun with the Microsoft Managed Extensibility Framework Part 2

    Oct 6, 2010 - 0 Comment
  • Fun with the Microsoft Managed Extensibility Framework Part 1

    Sep 22, 2010 - 0 Comment
  • Website Management on the go with the iPad

    I appreciated your post, but I was looking for something I didn't...
    November 24, 2012 - drmoderator
  • Creating Energy Spheres in Photoshop

    I'm a little stuck down here especially at the step of creating the...
    November 23, 2012 - sarah
  • Running background processes in PHP

    Can you give an example? As see it, you can use this only when you...
    November 16, 2012 - Shaked Klein Orbach
Developer Resources
  • Tutorial Directory
  • Learn HTML
  • Learn PHP
  • Learn CSS
  • Learn AJAX
  • Learn JavaScript
  • Learn Pear
  • White Papers
  • Resources
    • NetVisits Web Directory
    • Realtor Pixels
    • Answers On The Run
    • Ask A Geek
  • Recent Posts

    • When Does Hosting Your Website in the Cloud Make Sense?
    • Fun with the Microsoft Managed Extensibility Framework Part 2
    • Fun with the Microsoft Managed Extensibility Framework Part 1
    • Website Management on the go with the iPad
    • Code Contracts in C# 4.0 – Part 1

    Calendar

    May 2013
    M T W T F S S
    « Oct    
     12345
    6789101112
    13141516171819
    20212223242526
    2728293031  

    Recent Comments

    • drmoderator on Website Management on the go with the iPad
    • sarah on Creating Energy Spheres in Photoshop
    • Shaked Klein Orbach on Running background processes in PHP
    • Thomas Cuvillier on How To Upload Files Using PHP
    • rizal aditya on Extracting text from Word Documents via PHP and COM
    • Home
    © 2003 - 2013 DeveloperTutorials.com. All Rights Reserved. Privacy Policy.