Helping ordinary people create extraordinary websites!
HOME TUTORIALS SCRIPTS WEB HOSTING BLOG FORUM
Get Our Newsletter
Your Email:
Webmaster Blog

Parallel web scraping in PHP: cURL multi functions

by Akash Mehta


For anyone who’s ever tried to fetch multiple resources over HTTP in PHP, the logic is trivial, but one key challenge is ever-present: latency delays. While web servers have perfectly good downstream links, latencies can increase script execution time tenfold just by downloading a few external URLs. But there’s a simple solution: parallel cURL operations. In this tutorial, I’ll show you how to use the “multi” functions in PHP’s cURL library to get around this quickly and easily.

Caching alleviates the latency issue to some extent, but retrieving more than a few files is always going to be a problem, and, well, sometimes users just can’t wait. cURL’s parallel processing allows you to fire off multiple requests at a time and handle responses as they arrive, instead of linear operations – waiting for each request to complete (or worse, time out) before starting the next.

Consider this basic cURL example:

<?php
$ch = curl_init();
 
curl_setopt($ch, CURLOPT_URL, "http://example.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
 
$data = curl_exec($ch);
 
curl_close($ch);
?>

This will fetch the initial resource of http://example.com/ and put the data (the HTML) in $data. If we wanted to do this multiple times, we could use a simple for loop around this code block and repeat. However, through this method script execution time increases are linear, proportionate to the latencies of each network request, and latencies of 50-100ms x 10 requests don’t help when you barely spend <10ms executing all your PHP code.

Instead, we’ll use cURL’s parallel processing system. This requires a bit of a context shift – instead of running each operation, you now have to tell cURL all the operations to run, let it do it’s stuff, and then continue on once it has finished. The difference is that it doesn’t wait for each request – it runs them all simultaneously (network permitting). Here’s a basic example:

<?php
// Create two cURL handlers
$ch1 = curl_init(); $ch2 = curl_init();
 
// Set options on both
curl_setopt($ch1, CURLOPT_URL, "http://example.com/");
curl_setopt($ch2, CURLOPT_URL, "http://example2.com/");
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, 1);
 
$mh = curl_multi_init();
curl_multi_add_handle($mh,$ch1);
curl_multi_add_handle($mh,$ch2);
 
$running=null;
do {
    curl_multi_exec($mh,$running);
} while ($running > 0);
 
$data1 = curl_multi_getcontent($ch1);
$data2 = curl_multi_getcontent($ch2);
 
curl_multi_remove_handle($ch1);
curl_multi_remove_handle($ch2);
curl_multi_close($mh);
?>

Here, we first create a cURL connection object for each request we want to make – an array of these is perfectly acceptable – and set the options on each. Instead of curl_init(), we then call curl_multi_init() and point the library at each of our connection objects. We have to cede control from cURL at this point: curl_multi_exec() now runs all the sub-connections of the current cURL object – that is, $ch1 and $ch2.

The curl_multi_exec() function takes a second parameter, a reference to a flag of whether operations are still running. When that parameter – for us, $running – is 0, cURL’s finished taking care of requests and we can proceed. Timeouts are the only concern – remember to set the timeout option to a reasonable value to avoid being held up by requests that won’t complete.

As you increase in requests, however, remember that the next bottleneck you will hit is memory. Given PHP’s memory_limit flag (especially on shared servers), you can actually hit this quicker than you think. cURL can’t read part of a file or stream, even if it’s packet-based. Ever seen fread($handle, 8192)? The second parameter results in 8KB chunks, avoiding memory limits elegantly. cURL, however, will simply collect up all your request responses: if you hit your memory limit, you either get a fatal error or a “white screen of death”. Consider parallel cURL-ing in a background process. Also, these simple routines can get quite lengthy – consider building an abstraction layer for your app/framework, to maintain an array of cURL connection objects and interface requests.

Parallel HTTP requests in cURL are quick and easy, despite the architectural differences with libraries for single-threaded cURL. To learn more, just check out the detailed documentation on the PHP manual page.




Related Posts
» Paul Reinheimer’s PHP Contest
» Easy Screen Scraping in PHP with the Simple HTML DOM Library
» Learn regular expressions in PHP
» Say hello to namespace naming conventions!
» 8 Cool Functions in the GD2 extension
 


This post has 6 Responses so far.
  1. rogue Says:
    July 29th, 2008 at 8:23 am

    very nice. have been thinking about doing something like this in an app i’m working on. thanks.

  2. Kasper Garnæs Says:
    July 29th, 2008 at 2:24 pm

    Great article!

    It inspired me to modify my code to use multi cURL instead of cURL for a set of Google API calls.

    Results from a sample of 75 calls (average of 10 runs):
    - cURL: 3,7 secs
    - multi cURL: 1.8 secs

    A couple of minor corrections:
    - curl_multi_get_content() should be curl_multi_getcontent()
    - curl_multi_remove_handle($ch1) should be curl_multi_remove_handle($mh, $ch1);

  3. Sun Location Says:
    October 9th, 2008 at 6:54 am

    Multi curl have a limit of number of calls ?

  4. Josh Fraser Says:
    January 26th, 2009 at 8:17 pm

    Thanks for sharing this example. I made some modifications so that you can process each request as soon as it completes. It makes things a lot faster when you’re dealing with a large number of requests:

    http://onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/

  5. Eric Nagel Says:
    March 3rd, 2009 at 3:02 pm

    Wow – awesome! Just what I needed! I took your example & made it run off an array of data, using variable-variables to handle 1 to ??? instances!

    Whereas 2 queries took 14 seconds before, now 11 takes 6.7 seconds.

    Like Kasper said, curl_multi_remove_handle needs $mh passed to it, too.

  6. Torrent Search Says:
    March 4th, 2009 at 8:31 pm

    Great tutorial!

Leave a Reply

Ask A Question
characters left.