start page | rating of books | rating of authors | reviews | copyrights

Perl Cookbook

Perl CookbookSearch this book
Previous: 20.6. Extracting or Removing HTML Tags Chapter 20
Web Automation
Next: 20.8. Finding Fresh Links
 

20.7. Finding Stale Links

Problem

You want to check whether a document contains invalid links.

Solution

Use the technique outlined in Recipe 20.3 to extract each link, and then use the LWP::Simple module's head function to make sure that link exists.

Discussion

Example 20.5 is an applied example of the link-extraction technique. Instead of just printing the name of the link, we call the LWP::Simple module's head function on it. The HEAD method fetches the remote document's metainformation to determine status information without downloading the whole document. If it fails, then the link is bad so we print an appropriate message.

Because this program uses the get function from LWP::Simple, it is expecting a URL, not a filename. If you want to supply either, use the URI::Heuristic module described in Recipe 20.1 .

Example 20.5: churl

#!/usr/bin/perl -w # churl - check urls  use HTML::LinkExtor; use LWP::Simple qw(get head);  $base_url = shift     or die "usage: $0 <start_url>\n"; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse(get($base_url)); @links = $parser->links; print "$base_url: \n"; foreach $linkarray (@links) {     my @element  = @$linkarray;     my $elt_type = shift @element;     while (@element) {         my ($attr_name , $attr_value) = splice(@element, 0, 2);         if ($attr_value->scheme =~ /\b(ftp|https?|file)\b/) {             print "  $attr_value: ", head($attr_value) ? "OK" : "BAD", "\n";         }     } }

Here's an example of a program run:

% churl http://www.wizards.com 



http://www.wizards.com:



 



  FrontPage/FP_Color.gif:  OK



 



  FrontPage/FP_BW.gif:  BAD



 



  #FP_Map:  OK



 



  Games_Library/Welcome.html:  OK



This program has the same limitation as the HTML::LinkExtor program in Recipe 20.3 .

See Also

The documentation for the CPAN modules HTML::LinkExtor, LWP::Simple, LWP::UserAgent, and HTTP::Response; Recipe 20.8


Previous: 20.6. Extracting or Removing HTML Tags Perl Cookbook Next: 20.8. Finding Fresh Links
20.6. Extracting or Removing HTML Tags Book Index 20.8. Finding Fresh Links