- About this service
- What it does
- Use it online
- Install it locally
- Robots exclusion
- Comments, suggestions and bugs
About this service
In order to check the validity of the technical reports that W3C publishes, the Systems Team has developed a link checker.
A first version was developed in August 1998 by Renaud Bruyeron. Since it was lacking some functionalities, Hugo Haas rewrote it more or less from scratch in November 1999. It has been improved by Ville Skyttä and many other volunteers since.
The source code is available publicly under the W3C IPR software notice from CPAN (released versions) and CVS (development and archived release versions).
What it does
The link checker reads an HTML or XHTML document or a CSS style sheet and extracts a list of anchors and links.
It checks that no anchor is defined twice.
It then checks that all the links are dereferenceable, including the fragments. It warns about HTTP redirects, including directory redirects.
It can check recursively a part of a Web site.
There is a command line version and a CGI version. They both support HTTP basic authentication. This is achieved in the CGI version by passing through the authorization information from the user browser to the site tested.
Use it online
There is an online version of the link checker.
In the online version (and in general, when run as a CGI script), the number of documents that can be checked recursively is limited.
Both the command line version and the online one sleep at least one second between requests to each server to avoid abuses and target server congestion.
Access keys
The following access keys are implemented throughout the site in an attempt to help users using screen readers.
- Home: access key "1" leads back to the service's home page.
- Downloads: access key "2" leads to downloads.
- Documentation: access key "3" leads to the documentation index for the service.
- Feedback: access key "4" leads to the feedback instructions.
Install it locally
The link checker is written in Perl. It is packaged as a standard CPAN distribution, and depends on a few other modules which are also available from CPAN.
Install with the CPAN utility
If you system has a working installation of Perl, you should be able to install the link checker and its dependencies with a single line from the commandline shell:
sudo perl -MCPAN -e 'install W3C::LinkChecker' (use without the sudo command if installing from an administrator account).
If this is the first time you use the CPAN utility, you may have to answer a few setup questions before the tool downloads, builds and installs the link checker.
Install by hand
If for any reason the technique described above is not working or if you prefer installing each package by hand, follow the instructions below:
- Install Perl, version 5.8 or newer.
-
You will need the following CPAN
distributions, as well as the distributions they possibly depend on.
Depending on your Perl version, you might already have some of
these installed. Also, the latest versions of these may require a
recent version of Perl. As long as the minimum version requirement(s)
below are satisfied, everything should be fine. The latest version
should not be needed, just get an older version that works with your
Perl. For an introduction to installing Perl modules,
see The CPAN FAQ.
- W3C-LinkChecker (the link checker itself)
- CGI.pm (required for CGI mode only)
- Config-General (optional, version 2.06 or newer; required only for reading the (optional) configuration file)
- CSS-DOM (version 0.09 or newer)
- HTML-Parser (version 3.20 or newer)
- libwww-perl (version 5.802 or newer)
- Net-IP (optional but recommended; required for restricting access to private IP addresses)
- TermReadKey (optional but recommended; required only in command line mode for password input)
- Time-HiRes
- URI (version 1.31 or newer)
-
Optionally install the link checker configuration file,
etc/checklink.conf
contained in the link checker distribution package into/etc/w3c/checklink.conf
or set theW3C_CHECKLINK_CFG
environment variable to the location where you installed it. -
Optionally, install the
checklink
script into a location in your web server which allows execution of CGI scripts (typically a directory namedcgi-bin
somewhere below your web server's root directory). -
See also the
README
andINSTALL
file(s) included in the above distributions.
Running checklink --help shows how to use the command line version. The distribution package also includes more extensive POD documentation, use perldoc checklink (or man checklink on Unixish systems) to view it.
SSL/TLSv1
support for https
in the link checker needs support for
it in libwww-perl; see
README.SSL
in the libwww-perl distribution for more information.
In online mode, link checker's output should not be buffered to avoid
browser timeouts. The link checker itself does not buffer its output,
but in some cases output buffering needs to be explicitly disabled for
it in the web server running it. One such case is Apache's mod_deflate
compression module which as a side effect results in output buffering;
one way to disable it for the link checker (while leaving it enabled for
other resources if configured so elsewhere) is to add the following
section to an appropriate place in the Apache configuration (assuming the
link checker script's filename is checklink
):
<Files checklink> SetEnv no-gzip </Files>
If you want to enable the authentication capabilities with Apache, have a look at Steven Drake's hack.
The link checker honors proxy settings from the
scheme_proxy
environment variables. See
LWP(3) and
LWP::UserAgent(3)'s
env_proxy
method for more information.
Some environment variables affect the way how the link checker uses FTP. In particular, passive mode is the default. See Net::FTP(3) for more information.
There are multiple alternatives for configuring the default
NNTP
server for use with news:
URIs without explicit hostnames,
see
Net::NNTP(3)
for more information.
Robots exclusion
The link checker honors
robots exclusion rules. To place rules specific to the W3C Link Checker in
/robots.txt
files, sites can use the
W3C-checklink
user agent string. For example, to allow
the link checker to access all documents on a server and to disallow
all other robots, one could use the following:
User-Agent: * Disallow: / User-Agent: W3C-checklink Disallow:
Robots exlusion support in the link checker is based on the
LWP::RobotUA
Perl module. It currently supports the
"original 1994 version"
of the standard. The robots META tag, ie.
<meta name="robots" content="...">
, is not supported.
Other than that, the link checker's implementation goes all the way
in trying to honor robots exclusion rules; if a
/robots.txt
disallows it, not even the first document
submitted as the root for a link checker run is fetched.
Note that /robots.txt
rules affect only user agents
that honor it; it is not a generic method for access control.
Comments, suggestions and bugs
The current version has proven to be stable. It could however be improved, see the list of open enhancement ideas and bugs for details.
Please send comments, suggestions and bug reports about the link checker to the www-validator mailing list (archives), with 'checklink' in the subject. See examples below
- Good
- Subject: online checklink times out when accessed with Iceweasel 2.1.12
- Bad
- Subject: checklink
- Bad
- Subject: checklink does not work
Known issues
If a link checker run in "summary only" mode takes a long time, some user agents may stop loading the results page due to a timeout. We have placed workarounds hoping to avoid this in the code, but have not yet found one that would work reliably for all browsers. If you experience these timeouts, try avoiding "summary only" mode, or try using the link checker with another browser.