ASP 101 - Active Server Pages 101 - Web04
The Place ASP Developers Go!

Please visit our partners


Windows Technology Windows Technology
15 Seconds
4GuysFromRolla.com
ASP 101
ASP Wire
VB Forums
VB Wire
WinDrivers.com
internet.commerce internet.commerce
Partners & Affiliates














ASP 101 is an
internet.com site
ASP 101 is an internet.com site
IT
Developer
Internet News
Small Business
Personal Technology

Search internet.com
Advertise
Corporate Info
Newsletters
Tech Jobs
E-mail Offers

ASP 101 News Flash ASP 101 News Flash



 Top ASP 101 Stories Top ASP 101 Stories
Migrating to ASP.NET
Getting Scripts to Run on a Schedule
The Top 10 ASP Links @ Microsoft.com

QUICK TIP:
Query your database with ADO
Show All Tips >>
ASP 101 RSS Feed ASP 101 Updates


Building A Web Spider

Building A Web Spider

by Chris Payne

Introduction

Web spiders are probably one of the most useful tools ever developed for the internet. After all, with millions of separate and different sites out there today, how else can you gather all that information?

A spider does one thing - it goes out on the web and collects information. The way a typical spider (like Yahoo) works is by looking at one page and finding the relevant information. It then follows all the links in that page, collecting relevant information in each following page, and so on. Pretty soon, you'll end up with thousands of pages and bits of information in your database. This web of paths is where the term 'spider' comes from.

So how do you create a web spider? We'll explain that below, but first we'll need to outline some concepts.

Fundamentals

Web spiders can be built to search many things. In fact, there are several specific commercial spiders out there, and these applications draw big bucks ($300k to license Altavista's technology, for example). Here are the fundamentals of a web spider:

  • Collects information from a variety of sources
    Technically, this should be from any sources, and not limiting. The more sources, the better.

  • Accurate
    We all know the complaints about search engines returning 1 million plus results when only the last two are what you are looking for (or even worse, the middle two). A spider should be accurate in the items it returns, and in many cases, specific (i.e., a spider that only returns a certain type of information, such as the gaming spiders on www.enfused.com).

  • Relatively up-to-date
    This depends on the technique you use to implement the spider (see section below), but a spider should return up-to-date information, or least reasonably so. There's no point (in most cases) of having a spider if it only returns items that are 5 years old.

  • Relatively quick
    The point of a spider is to make information gathering faster. It doesn't matter how accurate your spider is if takes forever to return results.

Techniques

There are a few ways to spider. The first, which I'll call general spidering, simply grabs a page, and searches it for whatever you're looking for - for instance, a search phrase. The second, specific spidering, grabs only a certain portion of a page. This scenario is useful in cases where you might want to grab news headlines from another site.

General spidering is the easier of the two. First of all, you don't need to have any knowledge of the page beforehand. Simply look within that page for your search term, and links to other pages. If you want to get fancy, you can build in functionality to ignore links that are within the same site.

A specific spider usually requires you to have some knowledge of the page beforehand, such as table layout. For instance, if you're looking for news headlines on a page, then you should know what HTML tags delimit the headlines, so you only search the right portion of the page. In this case, it is usually not important to spider each link on the page, especially since your spider might not work on different pages.

There are also different times you can perform a spider: beforehand, and real time. Doing it beforehand means that any information you collect while your spider is running is stored in a database, for access later. You obviously won't have the most recent data, but if you run the spider often enough, it won't matter.

Doing it in real time means that you don't store any information - you run the spider every time you need it. For instance, if you had a search function on your web site, spidering in real time would mean that whenever a user enters a search term and presses submit, you would run the spider, versus simply querying a database of items created beforehand. While this will ensure that you always have the latest data, this option is usually not preferred because of the time required to spider and return anything of value. Use this option only when the material you are spidering is very time sensitive.

From an ASP?

So how can you implement a spider from an Active Server Page? With the magic of the internet transfer control (ITC). This control, provided by Microsoft, allows you to access internet resources from an ASP (check here for a good reference). You can create this object in an ASP, and use it to grab web pages, access ftp servers, and even submit POST headers. (Note: for this article, we will only be focusing on the first capability listed here.)

There are a few drawbacks, however. For one thing, Active Server Pages are not allowed to access the Windows registry, which means that certain constants and values that the ITC normally stores there will not be available. Normally, you can get around this issue by not allowing the ITC to use default values - specify the values every time.

Another, more serious, problem involves licensing issues. ASPs do not have the ability to invoke the license manager (a feature of Windows that makes sure components and controls are being used legally). The license manager checks the key in the actual component, and compares it to the one in the Windows registry. If they're not the same, the component won't work. Therefore, if you decide to deploy your ITC to another computer that doesn't have the necessary key, it breaks. A way around this is to bundle up the ITC in another VB component that basically duplicates the ITC's methods and properties, and then deploy that. It's a horrible pain, but unfortunately must be done. Read this MSDN article for more info.

Show me some examples!

You can create and set up the ITC with the following code:

set Inet1 = CreateObject("InetCtls.Inet")
Inet1.protocol = 4         'HTTP
Inet1.accesstype = 1       'Direct connection to internet
Inet1.requesttimeout = 60  'in seconds
Inet1.URL = strURL
strHTML = Inet1.OpenURL    'grab HTML page

strHTML now holds the entire HTML content of the page specified by strURL. To create a general spider, you can now do a simple call to an instr() function to determine if the string you're looking for is there. You can also look for href tags, parse out the actual URL and set it to the URL property of the internet control, and open up another page. The best way to look through all the links this way would be to use recursion (see this article for a lesson on recursion).

Note, however, that while this method is pretty easy to implement, it is not very accurate or robust. Many search engines out there today perform additional logic checks, such as the number of times a phrase appears in a page, the proximity of related words, and some even claim to judge the context of the search phrase. I'll leave these to you as you explore spiders. For more info on detailed searches, here's a good article on creating a spider program to rival "Ask Jeeves."

A specific spider is a bit more complicated. As we mentioned earlier, a specific spider will grab a certain portion of a page, and that requires knowing ahead of time which portion. For instance, let's look at the following HTML page:

<HTML>
<HEAD>
<TITLE>My News Page</TITLE>
<META Name="keywords" Content="News, headlines">
<META Name="description" Content="The current news headlines.">
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#FF3300"
VLINK="#CC0000" ALINK="#0000FF">
<p><h3>Headlines</h3></p>
<!--put headlines here-->
<a href="/news/8094.asp">Stocks prices fall</a>
<a href="/news/8095.asp">New movies today</a>
<a href="/news/8096.asp">Bush and Gore to debate tonight</a>
<a href="/news/8097.asp">Fall TV lineup</a>
<!--end headlines-->
</BODY>
</HTML>

In this page, you really only care about the stuff between the "put headlines here" and "end headlines" comment tags. You could build a function that would return only this section:

Function GetText(strText, strStartTag, strEndTag)
dim intStart
intStart = instr(1, strText, strStartTag, vbtextcompare)
if intStart then
	intStart = intStart + len(strStartTag)
	intEnd = InStr(intStart + 1, strText, strEndTag, vbtextcompare)
	GetText = Mid(strText, intStart + 1, intEnd - intStart - 1)
else
	GetText = " "
end if
End Function

Using the example of creating the ITC control above, you would simply pass in strHTML, "<!--put headlines here-->", and "<!--end headlines-->" as parameters to the GetText function.

Note that the start and end tags do not have to be actual HTML tags - they can be anything text delimiter you wish. Often times, you won't find nice HTML tags to delimit the sections you're looking for. You'll have to use whatever is available - for instance, your start and end tags could look like:

strStartTag = "/td><td><font face="arial" size="2"><p><b><u>"
strEndTag = "<p></td></tr><tr><td><o:ums>"

Make sure to find something unique in the HTML page so that you extract exactly what you need. You can also follow the links in the portion of text you return, but beware that if you don't know the format of those pages, your spider could return nothing.

Storing the info

In most cases, you're going to want to store the information that you collect in a database for easy access later. Your needs here may very widely, but here are a few things to keep in mind:

  • Check for the latest information in your database
    If you run this spider often to check a site for new headlines, make sure that you take note of the newest headline that is already in your database. Then compare that to what the spider returns, and only add the new ones. That way, you won't end up having a lot of duplicate data in your database.
  • Update information
    You may not want to add new information to your database at all. For instance, if you are maintaining an online index of US state populations, then you'll only want to update the information in your database - there will never be a need to insert new information in the table (until we get a new state, that is).
  • Store everything you need, and build what you don't have
    For instance, if you spider headlines, make sure you also grab the links that the headlines point to, and store that in your database. If there are no links supplied, you may need to build one. For example, I'm spidering headlines from www.yoursite.com, to display on www.mysite.com. If the headline has a story linked to it that resides on your web site, I will also have to store http://www.yoursite.com in front of whatever link in on your server in my database so that the links work correctly.
A link on www.yoursite.com... On www.mysite.com Turns into...
/stories/news/980345.html http://www.yoursite.com/stories/news/980345.html

Conclusion

This article should give you a very good idea about how to build a more complete spider. All of the basic functionality is laid out here, all you have to do is add the bells and whistles.

This type of application begs to be placed in a COM object or in a separate application by itself. Placing this functionality in an ASP would be very convenient, but you would gain speed and security benefits by moving your code elsewhere. (not to mention the fact that it would be easier to package and sell).

Happy scripting!


Home |  News |  Samples |  Articles |  Lessons |  Resources |  Forum |  Links |  Search |  Feedback

Internet.com
The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers