Since writing my article on using Microsoft's Index Server from ASP
(Part 1, Part 2),
I've gotten quite a few questions about why people aren't getting the
results they expect. There are a number of reasons why this might
happen, but one of the most common is that your query includes one
or more words that index server considers "noise words". This article
will explain what noise words are and show you how to edit the list of
words that index server treats as noise words.
The Email
You can thank Yu Zhang for finally getting me to write this article.
I've answered quite a few questions about noise words, but his came
at the issue from a little bit different angle. Here's his email:
Hi John,
Your article "Using Index Server to Search Your Web Site",
helped me a lot, but, I'm having a problem with 'reserved words' such
as 'i' and 'about'. To deal with this problem, I need to find the
list of Index Server's reserved words so I can filter them.
Do you know where to find the list?
Thanks, Yu
As it turns out, I didn't know where to find the list... so, I found out.
Having taken the time to do so, I figured I should share the info with everyone.
What are Noise Words?
Noise words are words that are very common and yet have very little meaning.
Words like 'a', 'an', 'the', 'to', 'so', 'with', etc. are found in almost all
documents but provide very little information about the actual meaning of the
document. Therefore, there is very little value to be gained from knowing that
a document contains any of them. Because of this, Index Server is designed to
ignore these type of words when it builds an index from a set of documents.
So, to answer Yu's question from above, you can find the list of all the words that
Index Server considers noise words in the System32 folder of your Windows directory.
There you'll find a bunch of files named noise.xxx, where xxx represents the language
in question. For US English, the file name is noise.enu. On my laptop, the complete
path to this file is C:\Windows\System32\noise.enu. The file is a plain text file and
you can open and edit it using the text editor of your choice (Windows' Notepad works fine).
Editing the List of Noise Words
So why would you want to add or remove a word? Let's say your site is named "ASP 101"
and every page title includes the phrase "ASP 101". In that case, searching for "ASP"
might be pretty pointless since it would return every single document and that really sort of
defeats the point of searching for something now doesn't it? To avoid this problem, we might want to add
"ASP" and "101" to the list of noise words so that Index Server would ignore
them while indexing and produce a smaller index and provide faster search results. It would also
prevent users from searching for "ASP" and getting back an unmanageable set of results.
Editing the noise word list is basically as simple as editing the text file.
As always, you should make a backup copy before you do so and there are a few other
caveats, but they are all discussed in Microsoft's
Knowledge
Base Article #247561 - How to Edit Index Server Noise-Word Lists so I won't go
into them here.
That's All Folks
I hope this article has helped shed some light on the topic of noise words for
all of you using Index Server. And, keep on sending in those questions... someone
has to tell me what you guys want to read about.
As an aside... I just love how much support Microsoft gives Index Server. Check out
all the information at the
Index Server Support Center.
I realize it's not their flagship product or anything, but come on guys... give us something!