Scraping Links With PHP
By Justin Laing2008-01-06
Iterate And Store Your Links
Next we’ll iterate through all the links we’ve gathered using XPath and store them in a database. First the code to iterate through the links:
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
storeLink($url,$target_url);
}
$hrefs is an object of type DOMNodeList and item() is a function that returns a DOMNode object for the specified index. The index can be between 0 and $hrefs->length. So we’ve got a loop that retrieves each link as a DOMNode object.
$url = $href->getAttribute('href');
DOMNodes inherit the getAttribute() function from the DOMElement class. getAttribute() returns any attribute of the node (in this case an tag with the href attribute). Now we’ve got our URL and we can store it in the database.
We’ll want a database table that looks something like this:
CREATE TABLE `links` (
`url` TEXT NOT NULL ,
`gathered_from` TEXT NOT NULL ,
`time_stamp` TIMESTAMP NOT NULL
);
We’ll a storeLink() function to put the links in the database. I’ll assume you know the basics of how to connect to a database (If not grab a MySQL & PHP tutorial here).
function storeLink($url,$gathered_from) {
$query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
mysql_query($query) or die('Error, insert query failed');
}
Tutorial Pages:
» Scraping Links With PHP
» Get The Page Content
» Tip: Fake Your User Agent
» Using PHP’s DOM Functions To Parse The HTML
» XPath Makes Getting The Links You Want Easy
» Iterate And Store Your Links
» Your Completed Link Scraper
» What Else Could I Do With This Thing?
» Is Scraping Content Legal?
Originally posted on Makebeta
