ASP 101 - Active Server Pages 101 - Web04
The Place ASP Developers Go!



Windows Technology Windows Technology
15 Seconds
4GuysFromRolla.com
ASP 101
ASP Wire
VB Forums
VB Wire
WinDrivers.com
internet.commerce internet.commerce
Partners & Affiliates
ASP 101 is an
internet.com site
ASP 101 is an internet.com site
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

ASP 101 News Flash ASP 101 News Flash



 Top ASP 101 Stories Top ASP 101 Stories
An Overview of ASP.NET
Connections, Commands, And Procedures
What is ASP?

QUICK TIP:
(<OBJECT> vs. Server.CreateObject())
Show All Tips >>
ASP 101 RSS Feed ASP 101 Updates


Preventing Broken Links

Before we move on to the techniques for creating your own custom error pages, we'll look at some of the ways that you can help to prevent broken links from appearing both on your site, and on other sites and search engines that provide links to your site.

Tools for Automatically Checking and Maintaining Links

If you create and maintain your site using one of the proprietary tools, such as Microsoft Visual InterDev or FrontPage, you can take advantage of their features for monitoring links. Using the View Links on WWW option, you can see a graphical representation of your site that automatically highlights any broken links.

For example, the following screenshot shows Visual InterDev displaying the links in a special version of the Stonebroom Software home page-which intentionally introduced two broken links so that you can see how it works:

The broken links are shown with the icon appearing to be torn into two pieces. And, although you can't see it here, the 'broken link' lines are also displayed in red, rather than gray like the other links. Notice also that one of the broken links (software.htm) is on this site, whereas the other (http://www.ewarhouse.com) is a misspelled external link. This is visible because we turned on the Show External Files option using the toolbar shown above. Most tools, including Visual InterDev and FrontPage, will also display links to all the graphics on the pages as well-we had this option turned off in the previous screenshot.

There are several tools available that can provide this feature, in a range of different ways. You'll also find that most tools which allow you to create and mange you site's content will attempt to automatically update the links when you move pages from one folder to another. This can be a real time-saver, and avoid most instances of this kind of error.

Check your HREF Syntax

Even if you get the name of the page or graphic correct, you can still sometimes cause broken links. A typical case is where you only check your site using Internet Explorer, because it is a lot less fussy about the syntax of links than most other browsers. In particular, if you use a backslash instead of a forward slash in a URL, Navigator will disregard everything after the backslash. This means that it won't find graphics or other pages that are referenced in a relative link.

For example, if we insert an image into a page named http://sunspot/test/stuff/mypage.htm using:

<IMG SRC="mypicture.gif">

then load this page into Navigator or Internet Explorer, it works fine. And if we put a link with the wrong type of slash on another page that points to this page:

<HTML><A HREF="stuff\mypage.htm">Go to my page</A></HTML>

IE converts the backslash into a forward slash and still finds the picture in the new page. However, in Navigator (and many other browsers), you don't get the image. They assume the image to be in the directory below- test in these screenshots. You can see the unconverted slash character in the Address box in the next screenshot:

The same applies if you include a backslash in the path to an image. It's an easy mistake to make if you are used to typing physical paths to files in a DOS Command window. Hence the often-stated advice-always check your pages in as many different browsers as you can.

Provide Default Pages

Back in Chapter 2, we discussed how we can use a default page to redirect visitors to an appropriate starting point for the resource they are looking for, or a main Home page. This is useful when users access the site without providing the name of a page, for example http://yoursite/usefulpages/. You can send back a menu of the useful pages you provide, or a Home page containing a prominent link to these useful pages.

The default page can be an HTML page (usually default.htm) which contains a <META> redirection tag, some client-side redirection code, or just a normal <A> link-or preferably all three, as shown in Chapter 2. However, an even better solution is to use an ASP page (usually default.asp) that performs the redirection through a Response.Redirect statement. This is less obvious than displaying a link for them to click, or a blank page with a <META> redirection tag.

For example, we might have this line in the default.asp page in our Reference directory:

<% Response.Redirect "/default.asp?page=reference" %>

Any user that accesses the URL http://webdev.wrox.co.uk/reference/ will be automatically and invisibly redirected to the main site frameset page-default.asp in the root directory of the whole site. But, because this page accepts a parameter named page in the query string that will load a particular page into the main frame, they get a frameset with the Wrox Reference Tools menu page displayed instead of the main site Home page:

...
<FRAMESET cols="100,*" FRAMEBORDER="NO" BORDER="NO">
<%
strPageURL = Request.QueryString("page")
If Len(strPageURL) = 0 Then 'main Home page
%>
<FRAME src="/navigate.htm" SCROLLING="NO" MARGINWIDTH=1 MARGIN0>
<FRAME src="/webdev/WhatsNew.asp" NAME="mainframe">
<%
Else
Select Case strPageURL
Case "reference" 'reference tools menu
%>
<FRAME src="/nav_wdr.htm" SCROLLING="NO" MARGINWIDTH=1 MARGIN0>
<FRAME src="/reference/reference.asp" NAME="mainframe">
<%
... 'code for other 'page' values goes here
...
<%
End Select
End If
%>
</FRAMESET>
...
Although search engines and other sites tend to specify the full URL (i.e. including the page's filename) when they link to your site, this technique is still useful. Experienced visitors-when faced with a 'Not Found' message-often trim off the file name or parts of the path in the URL in their browser's address box and try again. If you've provided a default page, you'll catch the request this time.

And of course it's always worthwhile trying to get other sites to omit the filename when they link to your site anyway-especially if it's to a specific subset of pages. One way to do this is to make it easy for other sites to put links to you on their pages. On the Web-Developer site we provide a Trade A Link page that contains not only some graphics that other sites can use, but also the code to insert them into the HTML. This means that we get to specify the HREF they use:

Using a Directory Listing

Back in Chapter 2, we mentioned the risks involved in allowing users to browse the folders on your Web server. If you turn on the Directory Browsing Allowed option (which can be done for an individual directory in IIS4, but only for the whole site in IIS3), visitors will be able to view that folder's contents and navigate between folders that don't contain a default file (usually default.htm or default.asp):

This might be useful if you want to give visitors free access to all the contents of that folder on the server. However, if the pages depend on each other (for example they are part of an application or have to be loaded in a particular order), visitors may get inconsistent results if they load a page that is not the proper 'start' page. Also remember that the listing allows users to move from one folder to another, so make sure that at least one physical folder above this one (i.e. nearer to the root folder of the site) has a default.htm and default.asp file to prevent the entire site's contents from being listed.

Problems with Search Engines

One area where more broken links that ever can arise is when your page is indexed by one of the Web's search engines. This might be done without you realizing. Often 'crawlers' that you've never even heard of follow links from one site to another, so they can index your site while you know nothing about it. The Excite Web Robot pages list 173 known robots currently active on the Web at the time of writing-see http://info.webcrawler.com/mak/projects/robots/robots.html for more details.

If you have custom visitor logging enabled (as demonstrated in the next chapter) you can see if any crawlers have passed through your site, by examining the list of user agents and comparing them to the Excite Robots list.

The traditional search engines, such as Yahoo, Alta Vista, Infoseek, Excite, etc. maintain huge indexes that contain millions of entries, and some don't get round to checking and confirming that entries are valid very often-if ever. They depend on you to do the work of keeping your index entries up to date. However, with the dozens of search engines in everyday use, it's very difficult to keep track of every link to your site.

Remove Old Pages from Search Engines

However, if you discover a broken link to your site, you can (and should) do something about it. One option is to place a default.asp or default.htm redirection file at the referenced position in your site. Then, visitors following the link provided by the search engine will be redirected to your new page, or directly to your Home page.

Alternatively, once you know which search engine holds the broken link, you can go to that search engine's site and remove it. Almost all search engines provide a facility to report and remove old and non-existent links, for example here is the page on Infoseek for doing just that:

Alta Vista also provides the same service. Their 'Scooter' crawler checks that indexed pages still exist, and removes them automatically. However, it may not get round to your site for a while. To remove pages from the index yourself, you just submit the old invalid URLs using the normal Add/Remove URL option. Scooter will then attempt to index them immediately, find that the have gone, and delete them from the index.

Preventing Directory and File Indexing

If you want to prevent crawlers and search engines from indexing parts of your site, you can use either a special <META> tag or a robots.txt file. Both techniques are simple enough to implement, though you can't guarantee that all search engines will abide by your instructions. However, one or both are worth including, as most of the popular search engines recognize them.

Using a robots.txt File

The simplest and quickest way to control indexing of your pages is to provide a single robots.txt file in the root folder of your site. Note that this file must be in the folder referenced by your root URL, i.e. /robots.txt or http://yoursite/robots.txt, and that the filename is case-sensitive-it must all be in lowercase.

Inside the file you place a single User-agent identifier, followed by any number of Disallow statements. Each one prevents indexing of this folder (or file), and any folders below this folder. You can add comments after a hash character in any line. The usual kinds of entries are:

User-agent: * # applies to all robots
Disallow: /fileorfolder # all files and folders with this name
Disallow: /thisfolder/ # disallow a folder with this name
Disallow: /thisfolder/thisfile.htm # disallow just this file
Note that the entry for /fileorfolder will prevent indexing not only of any folder named fileorfolder that is in the root folder of your site, but also any pages named fileorfolder.htm, fileorfolder.asp, etc. which are in that folder. To prevent any part of your site being indexed, you can use the file:

User-agent: * # applies to all robots
Disallow: / # applies to the entire site
Remember that, even though Windows NT is relaxed about case sensitivity, this is not universally so on the Web. Make sure the URLs are all of the correct case in your robots.txt file. Search Engines change the rules that they apply when indexing sites, in particular to protect themselves from multiple entries pointing to the same page. At the time of writing, some were reviewing their policy on ASP and other dynamically created pages. Search sites generally publish the conditions that they apply when indexing sites, and you should keep up to date with these to make sure your sites are included where this is appropriate

Using a <META> ROBOTS tag

The alternative method of controlling indexing is with a <META> tag in each page or each section menu page. However (as in the case of using a robots.txt file), you can't guarantee that an instruction in one page will prevent indexing of pages linked to, or in folders below, this page.

The <META> tag only instructs the crawler on whether it can index this page and follow the links in this page. If you stop it following links to, say, secret.htm in this page, it may still find another page that links to secret.htm and has no <META> indexing control tag. And this page could even be on a different Web site. You really should put the tag in all pages that you don't want to be indexed.

The tag itself is simple enough. The NAME part is just "robots", and the CONTENT part is a comma-delimited string of instructions. Here are some examples:

<META NAME="robots" CONTENT="noindex">
<META NAME="robots" CONTENT="nofollow">
<META NAME="robots" CONTENT="noindex,nofollow">
<META NAME="robots" CONTENT="none">
The first just prevents indexing of this page. The second allows indexing of this page, but prevents the crawler from following any links in the page. The third entry prevents it from indexing this page and following any links in it. The value none in the fourth entry is the equivalent of noindex,nofollow. You can also use index, follow and all, however, these are the default if omitted, so there is no real point in doing so.

Using Custom HTTP Headers

You'll recall from the discussion we had in Chapter 2 about <META> tags that they are often just another way of creating HTTP headers, but in the browser rather than on the server. In theory, a search engine or crawler should react to a <META> tag like this:

<META HTTP-EQUIV="robots" CONTENT="noindex">

in the same way as it does to the previous examples of the <META> tag. This would allow us to use custom headers created within Internet Information Server for all the files in one folder or virtual directory:

While we haven't been able to verify that this technique is reliable, there is no harm in implementing it as well as the more traditional methods.

One interesting point about robots is what you should do if you have pages that are secret, or for which you don't want to advertise their presence on your site. If you name them in robots.txt, are you just providing an excellent way for any crawler to find out about them? It's probably safer to stick to including a <META> tag in the page, making sure there are no links to it anywhere on your site, and of course protecting the file using one of the techniques we described in Chapter 5.

And Finally

And, finally, think about providing a site map or resources map to help your visitors find what they want more easily. However, even if you put all these techniques into practice, you can still get 404 Not Found errors. If a user types the URL of a page that doesn't exist, you can't do much about it. Even if there is a default page in that folder (when you have directory listing option disabled) the Web server will still send back a 404 Not Found error. To get round this, we can implement a custom error page like that we showed earlier in this chapter. We'll see how we created our custom error page next.


BackContentsNext
�1998 Wrox Press Limited, US and UK.