Helping ordinary people create extraordinary websites!
GET OUR NEWSLETTER
Your Email:
 

Scraping Links With PHP

By Justin Laing
2008-01-06


Tip: Fake Your User Agent

Many websites won’t play nice with you if you come knocking with the wrong User Agent string. What’s a User Agent string? It’s part of every request to a web server that tells it what type of agent (browser, spider, etc) is requesting the content. Some websites will give you different content depending on the user agent, so you might want to experiment. You do this in cURL with a call to curl_setopt() with CURLOPT_USERAGENT as the option:

$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);

This would set cURL’s user agent to mimic Google’s. You can find a comprehensive list of user agents here: User Agents.

Search Engine User Agents

  • Google - Googlebot/2.1 ( http://www.googlebot.com/bot.html)
  • Google Image - Googlebot-Image/1.0 ( http://www.googlebot.com/bot.html)
  • MSN Live - msnbot-Products/1.0 (+http://search.msn.com/msnbot.htm)
  • Yahoo - Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
  • ask

Browser User Agents

  • Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
  • IE 7 - Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
  • IE 6 - Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)
  • Safari - Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en) AppleWebKit/522.11 (KHTML, like Gecko) Safari/3.0.2
  • Opera - Opera/9.00 (Windows NT 5.1; U; en)


Tutorial Pages:
» Scraping Links With PHP
» Get The Page Content
» Tip: Fake Your User Agent
» Using PHP’s DOM Functions To Parse The HTML
» XPath Makes Getting The Links You Want Easy
» Iterate And Store Your Links
» Your Completed Link Scraper
» What Else Could I Do With This Thing?
» Is Scraping Content Legal?


Originally posted on Makebeta


 | Bookmark
Related Tutorials:
» Port Scanning and Service Status Checking in PHP
» Web Database Access from Desktop Applications
» CubeCart 3.0 Installation and Configuration
» PHP Site Search Made Easy
» Installing and Configuring Drupal 6.1
» Desktop Application Development with PHP-GTK

Advertise with Us!


Tutorials Scripts Web Hosting Developer Manuals
Resources