Before we move on to the techniques for creating your own custom error pages, we'll look at some of the ways that you can help to prevent broken links from appearing both on your site, and on other sites and search engines that provide links to your site.
Tools for Automatically Checking and Maintaining Links
If you create and maintain your site using one of the proprietary tools, such as Microsoft
FrontPage, you can take advantage of their features for monitoring links. Using the
WWW option, you can see a graphical representation of your site that automatically highlights any broken links.
For example, the following screenshot shows
InterDev displaying the links in a special version of the
Software home page-which intentionally introduced two broken links so that you can see how it works:
The broken links are shown with the icon appearing to be torn into two pieces. And, although you can't see it here, the 'broken link' lines are also displayed in red, rather than gray like the other links. Notice also that one of the broken links (software.htm) is on this site, whereas the other (http://www.ewarhouse.com) is a misspelled external link. This is visible because we turned on the
Files option using the toolbar shown above. Most tools, including
FrontPage, will also display links to all the graphics on the pages as well-we had this option turned off in the previous screenshot.
There are several tools available that can provide this feature, in a range of different ways. You'll also find that most tools which allow you to create and mange you site's content will attempt to automatically update the links when you move pages from one folder to another. This can be a real time-saver, and avoid most instances of this kind of error.
Check your HREF Syntax
Even if you get the name of the page or graphic correct, you can still sometimes cause broken links. A typical case is where you only check your site using Internet Explorer, because it is a lot less fussy about the syntax of links than most other browsers. In particular, if you use a backslash instead of a forward slash in a URL, Navigator will disregard everything after the backslash. This means that it won't find graphics or other pages that are referenced in a relative link.
For example, if we insert an image into a page named
then load this page into Navigator or Internet Explorer, it works fine. And if we put a link with the wrong type of slash on another page that points to this page:
<HTML><A HREF="stuff\mypage.htm">Go to my page</A></HTML>
IE converts the backslash into a forward slash and still finds the picture in the new page. However, in Navigator (and many other browsers), you don't get the image. They assume the image to be in the directory below-
test in these screenshots. You can see the unconverted slash character in the
Address box in the next screenshot:
The same applies if you include a backslash in the path to an image. It's an easy mistake to make if you are used to typing physical paths to files in a DOS
Command window. Hence the often-stated advice-always check your pages in as many different browsers as you can.
Provide Default Pages
Back in Chapter 2, we discussed how we can use a default page to redirect visitors to an appropriate starting point for the resource they are looking for, or a main
Home page. This is useful when users access the site without providing the name of a page, for example
http://yoursite/usefulpages/. You can send back a menu of the useful pages you provide, or a
Home page containing a prominent link to these useful pages.
The default page can be an HTML page (usually
default.htm) which contains a <META> redirection tag, some client-side redirection code, or just a normal <A> link-or preferably all three, as shown in Chapter 2. However, an even better solution is to use an ASP page (usually default.asp) that performs the redirection through a Response.Redirect statement. This is less obvious than displaying a link for them to click, or a blank page with a <META> redirection tag.
Any user that accesses the URL
http://webdev.wrox.co.uk/reference/ will be automatically and invisibly redirected to the main site frameset page-default.asp in the root directory of the whole site. But, because this page accepts a parameter named page in the query string that will load a particular page into the main frame, they get a frameset with the
Tools menu page displayed instead of the main site
<FRAMESET cols="100,*" FRAMEBORDER="NO" BORDER="NO">
strPageURL = Request.QueryString("page")
If Len(strPageURL) = 0 Then 'main Home page
<FRAME src="/navigate.htm" SCROLLING="NO" MARGINWIDTH=1 MARGIN0>
<FRAME src="/webdev/WhatsNew.asp" NAME="mainframe">
Select Case strPageURL
Case "reference" 'reference tools menu
<FRAME src="/nav_wdr.htm" SCROLLING="NO" MARGINWIDTH=1 MARGIN0>
<FRAME src="/reference/reference.asp" NAME="mainframe">
... 'code for other 'page' values goes here
Although search engines and other sites tend to specify the full URL (i.e. including the page's filename) when they link to your site, this technique is still useful. Experienced visitors-when faced with a 'Not Found' message-often trim off the file name or parts of the path in the URL in their browser's address box and try again. If you've provided a default page, you'll catch the request this time.
And of course it's always worthwhile trying to get other sites to omit the filename when they link to your site anyway-especially if it's to a specific subset of pages. One way to do this is to make it easy for other sites to put links to you on their pages. On the
Web-Developer site we provide a
Link page that contains not only some graphics that other sites can use, but also the code to insert them into the HTML. This means that we get to specify the HREF they use:
Using a Directory Listing
Back in Chapter 2, we mentioned the risks involved in allowing users to browse the folders on your Web server. If you turn on the
Allowed option (which can be done for an individual directory in IIS4, but only for the whole site in IIS3), visitors will be able to view that folder's contents and navigate between folders that don't contain a default file (usually default.htm or default.asp):
This might be useful if you want to give visitors free access to all the contents of that folder on the server. However, if the pages depend on each other (for example they are part of an application or have to be loaded in a particular order), visitors may get inconsistent results if they load a page that is not the proper 'start' page. Also remember that the listing allows users to move from one folder to another, so make sure that at least one physical folder above this one (i.e. nearer to the root folder of the site) has a default.htm and default.asp file to prevent the entire site's contents from being listed.
Problems with Search Engines
One area where more broken links that ever can arise is when your page is indexed by one of the Web's searchengines. This might be done without you realizing. Often 'crawlers' that you've never even heard of follow links from one site to another, so they can index your site while you know nothing about it. The
Robot pages list 173 known robots currently active on the Web at the time of writing-see
http://info.webcrawler.com/mak/projects/robots/robots.html for more details.If you have custom visitor logging enabled (as demonstrated in the next chapter) you can see if any crawlers have passed through your site, by examining the list of user agents and comparing them to the Excite Robots list.
The traditional search engines, such as
Excite, etc. maintain huge indexes that contain millions of entries, and some don't get round to checking and confirming that entries are valid very often-if ever. They depend on you to do the work of keeping your index entries up to date. However, with the dozens of search engines in everyday use, it's very difficult to keep track of every link to your site.
Remove Old Pages from Search Engines
However, if you discover a broken link to your site, you can (and should) do something about it. One option is to place a default.asp or default.htm redirection file at the referenced position in your site. Then, visitors following the link provided by the search engine will be redirected to your new page, or directly to your
Alternatively, once you know which search engine holds the broken link, you can go to that search engine's site and remove it. Almost all search engines provide a facility to report and remove old and non-existent links, for example here is the page on
Infoseek for doing just that:
Vista also provides the same service. Their 'Scooter' crawler checks that indexed pages still exist, and removes them automatically. However, it may not get round to your site for a while. To remove pages from the index yourself, you just submit the old invalid URLs using the normal
URL option. Scooter will then attempt to index them immediately, find that the have gone, and delete them from the index.
Preventing Directory and File Indexing
If you want to prevent crawlers and search engines from indexing parts of your site, you can use either a special <META> tag or a robots.txt file. Both techniques are simple enough to implement, though you can't guarantee that all search engines will abide by your instructions. However, one or both are worth including, as most of the popular search engines recognize them.
Using a robots.txt File
The simplest and quickest way to control indexing of your pages is to provide a single robots.txt file in the root folder of your site. Note that this file must be in the folder referenced by your root URL, i.e. /robots.txt or http://yoursite/robots.txt, and that the filename is case-sensitive-it must all be in lowercase.
Inside the file you place a single
User-agent identifier, followed by any number of Disallow statements. Each one prevents indexing of this folder (or file), and any folders below this folder. You can add comments after a hash character in any line. The usual kinds of entries are:
User-agent: * # applies to all robots
Disallow: /fileorfolder # all files and folders with this name
Disallow: /thisfolder/ # disallow a folder with this name
Disallow: /thisfolder/thisfile.htm # disallow just this file
Note that the entry for /fileorfolder will prevent indexing not only of any folder named fileorfolder that is in the root folder of your site, but also any pages named fileorfolder.htm, fileorfolder.asp, etc. which are in that folder. To prevent any part of your site being indexed, you can use the file:
User-agent: * # applies to all robots
Disallow: / # applies to the entire site
Remember that, even though Windows NT is relaxed about case sensitivity, this is not universally so on the Web. Make sure the URLs are all of the correct case in your robots.txt file.Search Engines change the rules that they apply when indexing sites, in particular to protect themselves from multiple entries pointing to the same page. At the time of writing, some were reviewing their policy on ASP and other dynamically created pages. Search sites generally publish the conditions that they apply when indexing sites, and you should keep up to date with these to make sure your sites are included where this is appropriateUsing a <META> ROBOTS tag
The alternative method of controlling indexing is with a <META> tag in each page or each section menu page. However (as in the case of using a robots.txt file), you can't guarantee that an instruction in one page will prevent indexing of pages linked to, or in folders below, this page.
<META> tag only instructs the crawler on whether it can index this page and follow the links in this page. If you stop it following links to, say, secret.htm in this page, it may still find another page that links to secret.htm and has no <META> indexing control tag. And this page could even be on a different Web site. You really should put the tag in all pages that you don't want to be indexed.
The tag itself is simple enough. The
NAME part is just "robots", and the CONTENT part is a comma-delimited string of instructions. Here are some examples:
<META NAME="robots" CONTENT="noindex">
<META NAME="robots" CONTENT="nofollow">
<META NAME="robots" CONTENT="noindex,nofollow">
<META NAME="robots" CONTENT="none">
The first just prevents indexing of this page. The second allows indexing of this page, but prevents the crawler from following any links in the page. The third entry prevents it from indexing this page and following any links in it. The value none in the fourth entry is the equivalent of noindex,nofollow. You can also use index, follow and all, however, these are the default if omitted, so there is no real point in doing so.
Using Custom HTTP Headers
You'll recall from the discussion we had in Chapter 2 about <META> tags that they are often just another way of creating HTTP headers, but in the browser rather than on the server. In theory, a search engine or crawler should react to a <META> tag like this:
<META HTTP-EQUIV="robots" CONTENT="noindex">
in the same way as it does to the previous examples of the <META> tag. This would allow us to use custom headers created within Internet Information Server for all the files in one folder or virtual directory:
While we haven't been able to verify that this technique is reliable, there is no harm in implementing it as well as the more traditional methods. One interesting point about robots is what you should do if you have pages that are secret, or for which you don't want to advertise their presence on your site. If you name them in robots.txt, are you just providing an excellent way for any crawler to find out about them? It's probably safer to stick to including a <META> tag in the page, making sure there are no links to it anywhere on your site, and of course protecting the file using one of the techniques we described in Chapter 5.
And, finally, think about providing a site map or resources map to help your visitors find what they want more easily. However, even if you put all these techniques into practice, you can still get
Found errors. If a user types the URL of a page that doesn't exist, you can't do much about it. Even if there is a default page in that folder (when you have directory listing option disabled) the Web server will still send back a
Found error. To get round this, we can implement a custom error page like that we showed earlier in this chapter. We'll see how we created our custom error page next.