Easy Screen Scraping in PHP with the Simple HTML DOM Library
Client-side developers always had it easy – libraries such as jQuery and Prototype make finding elements on the page reliable and efficient. In PHP, regular expressions tend to get rather messy, DOM calls can be confusing and verbose, and often the string functions just aren’t enough. In this tutorial, I’ll show you how to use the middle ground – the open source PHP Simple HTML DOM Parser library, which provides jQuery-grade awesomeness for easy screen scraping without messy regular expressions.
The Simple HTML DOM Parser is implemented as a simple PHP class and a few helper functions. It supports CSS selector style screen scraping (such as in jQuery), can handle invalid HTML, and even provides a familiar interface to manipulate a DOM.
Here’s a sample of simplehtmldom in action:
$html = file_get_dom('http://www.google.com/'); foreach($html->find('a') as $element) echo $element->href;
This snippet is fairly self explanatory – file_get_dom() is a simple helper function in the library that fetches the page and constructs a new simplehtmldom object around it. Once the object is available, we can easily use simple CSS selectors to find our elements – in this case, anchors – and iterate over them just as we would with PHP 5’s standard DOM classes. (The equivalent code with the standard DOM classes is twice as long.)
But the library doesn’t stop there – as well as traversing the DOM and extracting information, you can also alter it. Consider this snippet:
$html = str_get_html(' <div id="hello">Hello</div> <div id="world">World</div> '); $html->find('div', 1)->class = 'bar'; $html->find('div[id=hello]', 0)->innertext = 'foo';
The library supports many DOM-style approaches for manipulation, from exposing real attributes as shown here, to a few helper methods. It also includes other methods to traverse the current node – children(), parent(), first_child() and so on.
Real scraping? Easy. Here’s their Slashdot sample:
$html = file_get_html('http://slashdot.org/'); foreach($html->find('div.article') as $article) { $item['title'] = $article->find('div.title', 0)->plaintext; $item['intro'] = $article->find('div.intro', 0)->plaintext; $item['details'] = $article->find('div.details', 0)->plaintext; $articles[] = $item; } print_r($articles);
And finally, there’s always a simple save mechanism:
$html->save('altered-dom.html');
Ready to get started? Head over to the project website, online documentation or the project page on SourceForge.






August 6th, 2008 at 11:00 am
How do set this up with a server running locally? I don’t understand how to install the simplehtmldom library.
August 6th, 2008 at 11:03 am
Neat library.
BTW…
foreach($html->find(‘a’ as $element))
should be
foreach($html->find(‘a’) as $element)
no?
August 6th, 2008 at 9:07 pm
@WC Yes.
While I like several things about jQuery, I have to say that the syntax offers somewhat of a sharp learning curve for those not already familiar with CSS selectors. I have my own library similar to the Simple HTML DOM Parser project that offers similar functionality programmatically through the API rather than having the overhead of an expression parser in PHP userland for the sake of extreme concision in API calls. You can find my library here: http://svn.assembla.com/svn/php_domquery/trunk. The vast majority of the code is covered by unit tests and all API methods are documented using phpDoc-style docblocks.
August 7th, 2008 at 12:42 am
Thanks! was looking for an HTML scrapper
August 7th, 2008 at 1:39 am
Thanks WC.
@David: Just download the library .zip file, extract the files, locate the library .php file somewhere readable on your server and include it. Also check out the manual included in the download.
@Matthew: Interesting point, and thanks for mentioning DOM Parser. As far as I can see, simplehtmldom is just a wrapper around the DOM APIs in PHP 5, but it’s a very high level of abstraction and it does a lot in between (such as all these complex selectors). There are some regular expressions with considerable overhead, but it’s still reasonably efficient.
August 8th, 2008 at 1:20 am
mmm, tasty. i’ve been looking for something like this.
August 8th, 2008 at 6:47 am
Nice, I should try it.
I used htmlsql [http://www.jonasjohn.de/lab/htmlsql.htm] and it was good; very fast and easy to learn and write code; LIKE sql “SELECT href FROM a WHERE $class=’newsLinks’ “.
the only problem was that it could not parse the invalid html; for example ()
August 11th, 2008 at 12:39 pm
This looks promising. Why not try to get this into PECL or PEAR?
September 5th, 2008 at 4:56 am
simple and brief intro, nice work there!
February 25th, 2009 at 10:34 am
This is just what I need, thanks!
March 21st, 2009 at 4:25 am
Very useful article, but I get problem.
The webpage that I’m trying to parse has something in their script to avoid the parsing. So if I open it via browser (i.e. Firefox) it shows me the page, but when I’m trying to get HTML with PHP script, I get HTML that says something like (“your IP is bla-bla-bla and UserAgent bla-bla-bla blocked”)
Does anybody know how to simulate normal browser request with PHP? How they can detect PHP request?
Thanks in advance,
March 25th, 2009 at 1:16 pm
nice tutorial, Thanks
March 25th, 2009 at 1:17 pm
Can it be treated like some sort of re-usable component?
Thanks for this great tip
April 7th, 2009 at 8:49 am
I can read PHP and modify it from working with CMS like Joomla. But I cannot make this run. Can anyone point me to a tutorial on how to get this to run (get installed) on my web host? Thanks! (running PHP 5.X & MySQL)
April 17th, 2009 at 11:55 pm
What’s the advantage of using this library over the DOM built into PHP?
August 4th, 2009 at 2:23 pm
I have created a specific WordPress plugin using phpQuery and cURL to place web scraps inside your wordpress posts, pages or sidebar. Check http://webdlabs.com/projects/wp-web-scraper/ for a demo or download the plugin from http://wordpress.org/extend/plugins/wp-web-scrapper
August 17th, 2009 at 2:33 pm
Is there also a function to get the largest text from html body ?