ASP 101 - Active Server Pages 101 - Web05
The Place ASP Developers Go!

Please visit our partners


Windows Technology Windows Technology
15 Seconds
4GuysFromRolla.com
ASP 101
ASP Wire
VB Forums
VB Wire
WinDrivers.com
internet.commerce internet.commerce
Partners & Affiliates














ASP 101 is an
internet.com site
ASP 101 is an internet.com site
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

ASP 101 News Flash ASP 101 News Flash



 Top ASP 101 Stories Top ASP 101 Stories
Migrating to ASP.NET
Getting Scripts to Run on a Schedule
The Top 10 ASP Links @ Microsoft.com

QUICK TIP:
Make Your Site's RSS Feed Shine
Show All Tips >>
ASP 101 RSS Feed ASP 101 Updates


Checking Links to Other Sites

In Chapters 1 and 2, you saw the Links page on our Web-Developer site. This provides links to other related sites, and to sources of components and information on techniques that developers may be interested in:

However, as we discovered earlier in this chapter, providing links to other sites can be an easy way of introducing errors. Unless you are vigilant, and check each link on a regular basis, you won't know if one of the sites has moved its content around, changed the name of a file you are linking to, or just closed down altogether.

Using Default URLs for Links

We try and limit the effects of pages being moved around by following our own advice. Almost all the links are to a root URL or a specific folder on the target host, and not to a particular file. For example, we provide links to the World Wide Web Consortium at http://www.w3.org/, and the independent Active Server Pages resource site at http://www.activeserverpages.com/, using just their root URLs. In specific cases, we link to a pre-arranged file, such as Microsoft's Internet Explorer Home page at http://www.microsoft.com/ie/logo.asp. It's unlikely that this page will suddenly change, however, because it is used in the ' Get IE' banners you see at the foot of many Web pages.

But this alone doesn't guarantee that our visitors will always have a current and active page to go to. To make sure they do, we've implemented a simple administration page named checkdeadlinks.asp that can be used to check URLs to make sure that they respond, and to discover a bit about them. We'll show you how this is done next. The code for the finished page (shown below) is included with the samples for this book:

About Automated Link Confirmation

As well as loading a Web page or any other file into our browser, we can use specialized tools to retrieve the file and present it in a way more appropriate to the task in hand. In the case of the component we're using in this example, the appropriate way is as an ordinary character string, which we can store in a variable in our ASP code.

The ASP.HTTP Component

There are several components that can retrieve files across the Web, using the HTTP protocol. The one we use is produced by Stephen Genusa ( steve@genusa.com), and is available from his ASP Server Components Web site at http://www.serverobjects.com/. You can download a free time-limited evaluation copy of the latest version to experiment with, and we've included more details about the component with the samples available for this book.

Basically, the component accepts a URL as a string, plus various other values that control the timeout, the protocol version to use, the headers it will present to the host, etc. Once we've specified the properties for the component, it connects to the specified URL (just like a browser running on our server would), and fetches the page.

We are only using it in a basic way, but you'll no doubt find other ways of putting it to work in your own applications. For example, you can set it up to save the files it retrieves to disk, present usernames and passwords to remote hosts, etc.

Building the checkdeadlinks.asp Page

The checkdeadlinks.asp page has a simple enough job to do, though the code to implement it does become a little complex. The plan is to examine all the pages listed in our Links table, to make sure that the URLs we use to link to these pages are still valid. At the start of the page, we include a text file containing our database connection details (as we've done in most of the earlier examples), then we set the two timeout values.

Setting the Timeouts

The first timeout value is the ASP script execution timeout, which we'll increase from the default value of 90 seconds to 40 minutes. This is (hopefully) far more time than we'll need, but it allows for those days when the 'Net, or our connection to it, are running slowly. Remember, we'll be collecting each page listed in our Links table from its host site, and this could take some time.

The second timeout is the value we'll use while fetching individual pages-in our example this is 45 seconds. If the host server doesn't respond within that time, we'll flag the page up as being doubtful. If it regularly takes this kind of time to react, we probably don't want to provide a link to it anyway, because our visitors will give up waiting if they try to follow the link:

<%@ LANGUAGE=VBSCRIPT %>
<!-- #include virtual="/connect/linksdb.inc" -->
<% Server.ScriptTimeOut = 2400 'will probably take a while to run %>
<% seekTitleTimeout = 45 'seconds to wait for page to arrive %>
...
...
Getting a List of URLs from the Links Table

Now we can open the Links table and create a recordset containing the URL of each page. You've seen this kind of thing done many times before in this book:

...
<% '--get list of of HREFs from Links table--
QUOT = Chr(34)
CRLF = Chr(13) & Chr(10)
On Error Resume Next
Set oConn = Server.CreateObject("ADODB.Connection")
oConn.Open strConnect 'from include file at top of page
strSQL = "SELECT tLinkHRef FROM Links"
Set oRs = oConn.Execute(strSQL)
If (oRs.EOF) Or (Err.Number > 0) Then
Response.Write "<FONT FACE=" & QUOT & "Arial" & QUOT & " SIZE=3>" _
& "<B>Sorry, the database cannot be accessed.</B></FONT></BODY></HTML>"
Response.End
End If
...
Checking each URL

It's now time for the fun part, where we fetch each page and examine the contents. We'll use a couple of variables to keep track of the number of 'doubtful' pages we find (intNumPages), and to provide a set of unique window names to place in hyperlinks in this page (intWinNum). It makes sense to list each URL we examine as a hyperlink, so that we can easily open the site or page it refers to in cases where we get an error or warning. By opening each one in a separate browser window, we allow the administrator to view them without having to reload (and hence re-execute) our checkdeadlinks.asp page.

Then we start looping through the URLs in our recordset. We write the URL as a hyperlink into our page, creating our unique TARGET value as we go (the value of intWinNum is incremented at the end of the loop each time):

...
intNumPages = 0 'number of dead or possibly doubtful links found
intWinNum = 1 'target window number for URL to be opened in for checking
Do While Not oRs.EOF
strURL = oRs("tLinkHRef") 'get the link URL
'--process each link--
Response.Write "Processing Link to: <A HREF=" & QUOT & strURL & QUOT _
& " TARGET=" & QUOT & "CDLWin" & intWinNum & QUOT & ">" & strURL _
& "</A><BR>" & CRLF
...
 
Fetching the Page with the ASP.HTTP Component

To fetch the page, we first set the values of the two 'result' variables, strResult and strTitle, to empty strings, and then instantiate our component. Then we set the Url and TimeOut properties, and call the GetURL method to retrieve the page. If it returns an empty string, we probably got a timeout against the host server, so we'll print a suitable message into the page and increment the number of doubtful pages in intNumPages.

If we do get a result, we can look to see what it contains. Remember that the component returns the entire content of the page as a string, including the HTML tags, and we can manipulate it using normal string-handling functions. If the page is an error message, it will usually contain the word ' error' or ' invalid', for example ' HTTP Error 404' within the body of the page and ' Error 404' in the <TITLE> section. If it does, we'll flag this page as also being doubtful:

...
strResult = "" 'to hold entire retrieved contents of the page
strTitle = "" 'to hold the page title
Set oHTTP = Server.CreateObject("ASP.HTTP") 'create component
oHTTP.Url = strURL 'set the URL
oHTTP.TimeOut = seekTitleTimeout 'set the timeout
strResult = oHTTP.GetURL 'and get the page
Set oHTTP = Nothing 'destroy the component
If Len(strResult) = 0 Then
Response.Write "<B>>> No reply from server in " _
& seekTitleTimeout & "seconds.</B><P>" & CRLF
intNumPages = intNumPages + 1 'increment number of doubtful links
Else
If Instr(LCase(strResult), "error") > 0 _
Or Instr(LCase(strResult), "invalid") > 0 Then
Response.Write "<B>>> Request returned an error.</B><P>" & CRLF
intNumPages = intNumPages + 1 'increment number of doubtful links
Else 'extract the title from the page if there is one
...
Extracting the Page Title and Checking the Content

The next section of the code strips out the page title, by looking for the HTML <TITLE> and </TITLE> tags (in upper or lower case). If it doesn't find a title, it sets strTitle to ' Untitled Page at:' instead. And while we've got the page content, we can play with it as well.

For example, the next few lines of our code look to see if we somehow got a page with 'doubtful' content added to our list of links. You might also like to look for other words that identify if the page is connected with the topics you want to include in your list of links. If you provide links to other Windows NT pages, you could flag up any that didn't contain the words ' Windows NT' somewhere in the page:

...
Else 'extract the title from the page if there is one
intStart = Instr(UCase(strResult), "<TITLE>") + 7
intFinish = Instr(UCase(strResult), "</TITLE>")
If (intStart > 0) And (intFinish > intStart) Then
strTitle = Trim(Mid(strResult, intStart, intFinish - intStart))
End If
If Len(strTitle) = 0 Then strTitle = "Untitled page at:"
strResult = LCase(strResult) 'check for unwelcome content
If InStr(strResult, " sex ") Or InStr(strResult, " adult ") Or _
InStr(strResult, " porn ") Or InStr(strResult, " xxx ") Or _
InStr(strResult, " nude ") Or InStr(strResult, " sexy ") Then
intNumPages = intNumPages + 1 'increment number of doubtful links
Response.Write "<B>>> Content Warning!</B>   "
End If
Response.Write "Page title is: " & strTitle & "<P>" & CRLF
strResult = ""
End If
End If
oRs.MoveNext 'go to the next link
intWinNum = intWinNum + 1 'increment the target window number
Loop
Set oRs = Nothing
...
At the end of the previous section of code, we write the results of our content parsing exercise into the current page, and then go round and do the same for the next page. Once we've checked all the links, we write out a note as to what we found, and provide a link back to our main menu:

...
If intNumPages = 0 Then %>
<P>There were no unresponsive links, errors or content warnings.</P>
<% Else %>
<P><B>There were <% = intNumPages %> unresponsive links,
errors or content warnings.</B></P>
<% End If %>
<HR>
<FORM ACTION="mainmenu.asp">
<INPUT TYPE="SUBMIT" NAME="Submit" VALUE="Main Menu >">
</FORM>
</body>
</html>
Examining the Results

The next screenshot shows the results of running this page against our own Links table. It looks really useful, but notice that we have an error shown against two of the pages in the table, including the one you see near the bottom of the list:

We know that the site is still there, because we were visiting it yesterday. It might be that the folder has moved. By clicking the link, we can examine the page and see what happened. In fact, the page opens fine-so what went wrong?

The answer is in the way we checked for an error. The technique of just looking for the words ' error' or ' invalid' means that pages containing either of these words within the text will fail our test. And because our component returns the entire content of the page (HTML and all) as a string, the word that triggered the error might not be visible in the page. This is the case with the activeserverpages.com site-examining the source of the page we find the following line that creates an entry in a SELECT list on the page:

...

<OPTION value="/learn/dbtablewitherrortrap.asp">Db2table high quality

...

Some Notes About the Code

As you've just seen, the somewhat inelegant techniques we used for checking for an error do tend to trap valid pages. There are plenty of ways we could be more selective, for example, we could check for " error " and " invalid " (i.e. complete words), or even " http error ". However, it's easy enough to open a site that returns a 'doubtful' status anyway, so you may not think it worth doing any more complex processing.

The results we got earlier also include another anomaly, in that our own ' Doing Windows DNA' page (at http://webdev.wrox.co.uk/dna/) returned a title of ' Object Moved'. We know that this is because the page redirects the user via default.asp to collect a frameset (as shown in Chapter 2 and again earlier in this chapter). So, as you can see, picking out pages that are OK and those that aren't is more difficult than it first appears. How much code you implement in this respect ultimately depends on how many links you have to monitor, and how much you need to fully automate the process of picking out dead ones-without having to scan the list by eye.

You might also have noticed that we are destroying and re-instantiating the ASP.HTTP component each time we use it in the page, i.e. for each URL we check. It would seem to be more efficient to create an instance of it before the loop, and then use the same instance for each URL. In fact we are running an old version of the component (which we use in other applications as well) and this version sometimes gets confused by servers that timeout or return an invalid response. By re-instantiating it each time, we solve any problems this might cause.

You'll see very similar techniques to the ones we've used here later in the book, in Chapter 8, where we build up a list of sites that link to us (referrers). It uses the same ASP.HTTP component as we've done in our checkdeadlinks.asp page, but adds some extra features such as allowing users to delete links.


BackContentsNext
�1998 Wrox Press Limited, US and UK.