Helping ordinary people create extraordinary websites!
GET OUR NEWSLETTER
Your Email:
 

Scraping Links With PHP

By Justin Laing
2008-01-06


Get The Page Content

cURL is a great tool for making requests to remote servers in PHP. It can imitate a browser in pretty much every way. Here’s the code to grab our target site content:

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "
cURL error number:" .curl_errno($ch);
echo "
cURL error:" . curl_error($ch);
exit;
}

If the request is successful $html will be filled with the content of $target_url. If the call fails then we’ll see an error message about the failure.

curl_setopt($ch, CURLOPT_URL,$target_url);

This line determines what URL will be requested. For example if you wanted to scrape this site you’d have $target_url = “http://www.merchantos.com/makebeta/”. I won’t go into the rest of the options that are set (except for CURLOPT_USERAGENT - see below). You can read an in depth tutorial on PHP and cURL here.



Tutorial Pages:
» Scraping Links With PHP
» Get The Page Content
» Tip: Fake Your User Agent
» Using PHP’s DOM Functions To Parse The HTML
» XPath Makes Getting The Links You Want Easy
» Iterate And Store Your Links
» Your Completed Link Scraper
» What Else Could I Do With This Thing?
» Is Scraping Content Legal?


Originally posted on Makebeta


 | Bookmark
Related Tutorials:
» Port Scanning and Service Status Checking in PHP
» Web Database Access from Desktop Applications
» CubeCart 3.0 Installation and Configuration
» PHP Site Search Made Easy
» Installing and Configuring Drupal 6.1
» Desktop Application Development with PHP-GTK

Advertise with Us!


Tutorials Scripts Web Hosting Developer Manuals
Resources