We encountered XML namespaces briefly in Chapter 3. If you
recall, we established that namespaces are a means of naming some vocabulary
for the purpose of reusing elements it contains in another vocabulary. If
someone has published an excellent vocabulary for describing demographic
information and we are working on a vocabulary for an advertising application,
we might wish to reuse the demographics tags within our own vocabulary. We
should have a means of pointing back to some description of the vocabulary,
both for the purposes of attribution and for maintaining a link to the
authoritative source for the vocabulary. At a minimum, we want a way to identify
a particular tag name usage as being defined in the demographics vocabulary.
This prevents confusion when the same tag name is used by multiple
vocabularies. If an element is marked as belonging to a particular vocabulary,
the meaning should be unambiguous.
Declarations
Let's review the syntax for XML namespaces. We declare a
namespace as follows:
<tagname xmlns[:name]=URI>
The namespace applies to the named tag and its contents. If
we are going to deal with a number of tags from the same namespace, it is
convenient to declare the namespace at the highest possible level. Tags that
are not qualified are assumed to belong to the containing namespace. Note that
the URI need not refer to a DTD or other online definition. While that is
useful, it is not a requirement. The URI must simply provide a unique
designator for the namespace.
With that in mind, here are some valid namespace
declarations:
You may not be familiar with the prefix urn. It stands for universal resource name and is a
specific kind of universal
resource indicator (URI). Unlike a URL, which is another specific
type of URI, a URN just provides a name. Presumably, the name is widely
understood. In the examples above, the stuff
and pr prefixes will be used to denote namespaces,
but we have not provided any way for the curious reader to find out more about
what they mean. Hopefully, they are as familiar to the recipient of the
declarations as HTML and other universally recognized namespaces. If we want to
use a number of namespaces liberally throughout a document, we should declare
them early and provide a namespace prefix with which to qualify individual
element and attribute names. If we want to have an XML document whose root
element is <TRANSACTION> and
which borrows names from the BANKING and
FINANCE namespaces, we should declare both
namespaces in the root element:
Tags use namespace declarations in one of two ways. They
explicitly use the namespace if the tag name is qualified by the prefix
specified in the declaration. Our <TRANSACTION>
example declared two namespace prefixes, bank
and fin. Extending our <TRANSACTION>
example:
<fin:instrument>certificate of deposit</fin:instrument>
</TRANSACTION>
The institution element comes
from the BANK namespace, while the
instrument element comes from the FINANCE namespace.
Alternately, an element or attribute name is implicitly
qualified by the namespace declaration in whose scope it appears. If we declare
a namespace in some element, any element that is not otherwise qualified is
assumed to belong to the declared namespace. Suppose we changed our <TRANSACTION>
example just a bit:
<fin:instrument>certificate of
deposit</fin:instrument>
</TRANSACTION>
Note we've omitted a namespace prefix for the BANK namespace. The institution
element has no prefix, but it is implicitly assumed to come from the BANK namespace because it is contained within
the scope of that declaration.
Searching for Namespace Declarations
Namespaces are only intended to uniquely name elements and
attributes. Enumerating the namespaces used within a document can tell us
something about the meaning of the document, however. The fact that a namespace
is used within a document indicates that some meaningful term has been borrowed
from another vocabulary. If we can identify the namespaces referenced in a
document, we will see what domains contributed to its meaning.
The first task is to enumerate all the namespace
declarations within a given XML document. We could certainly traverse the
entire DOM parse tree and examine each attribute found for the substring xmlns.
Fortunately, we don't have to go to that
much trouble. MSXML supports the Extensible Style Language (XSL) draft, and XSL
includes a powerful pattern matching language. We can use this to enumerate all
the elements matching a particular pattern in our case, every attribute that
declares a namespace.
This topic is well over the cutting edge. Not only is XSL a
work in progress, but the XSL pattern matching support in MSXML includes some
extensions that have been submitted to the W3C as a note regarding a query
language for XML. The syntax that follows will certainly change and is meant to
indicate one way a query language can be used to help us in our search for
metadata.
Consider the following fragment of an XML document. We have
included two namespace declarations within the root element.
We can use the selectNodes
method of the node object to apply an XSL pattern string to the document and
receive an enumeration of all attributes matching the pattern. Assuming we've
created an instance of MSXML in the variable parser and loaded the document
above successfully, we can make the following call:
The selectNodes method takes
a string conforming to the rules of the XSL pattern matching syntax. Generally
speaking, the string above breaks down into a search scope, an indication of
what we are searching for, and a filter.
The scope is denoted by //,
meaning the entire document starting from the root. A single slash / would denote the root itself, while
./would indicate the current context. The
context is the point from which we start the search. We are searching from the
root element, but we want to look at the entire document.
The symbols @* mean any
attribute. It will be our filter that is going to have to limit the search
because we don't know exactly what attribute names we are searching for. This
is because we can declare a namespace prefix, but there is no way we can know
in advance what that will be.
Everything within the square brackets is the filter. nodeName() is a built-in function of MSXML
that can be evaluated at runtime. It will give us the name of the attribute.
The constraint >= 'xmlns' gives us
any attribute that begins with the substring xmlns.
This will match declarations that have prefixes defined, as well as those that
do not define a prefix.
If we apply the selection call to the sample XML document,
we find we have an enumeration of two items. These will be node objects, so we
can get their text property (this is
equivalent to childNodes.item(0).nodeValue).
Calling it on each of our two namespace declarations yields:
urn:myschema-first
urn:myschema-second
If we called for the xml
property of the node object instead of the text
property, we would get the entire attribute declaration, e.g., xmlns:ii="urn:myschema-second".
Other Namespace Support in MSXML
The DOM support in MSXML allows us to look at the parts of
an element or attribute name. The nodeName
property gives us the entire qualified name. prefix
gives us the namespace prefix, if any. basename
yields the unqualified name. For the element <ii:More>,
the results are:
Property
Result
nodeName
ii:More
prefix
ii
Basename
More
Enumerating Namespace Usage
Now we'll put this all together to analyze an XML document
for foreign namespace usage. We want to list all the namespace declarations in
a document, together with the qualified elements and attributes taken from them
and used in the document.
The following code comes from the sample file NSTest.html.
All code samples are available for download from our Web site at http://www.wrox.com
and this example can be
run from our site at http://webdev.wrox.co.uk/books/2270/.
The selectNodes call we saw
before gives us a list of declarations:
Now we have a collection of namespace declarations. We want
to iterate through the collection and perform searches for qualified elements
and attributes. We'll continue to use the selectNodes
call and search the entire document. We obtain a collection of qualified
elements with this call:
Note we've dropped the @
character from the search pattern. The unqualified *
character indicates that we are looking for elements. We use the basename of
the declaration together with a colon to give us the qualifying prefix. Recall
that the declaration has the prefix xmlns,
and a basename consisting of the prefix to use to qualify names from this
namespace. There's a problem, however. Since the XSL pattern matching syntax
doesn't allow us to use wildcards in our nodeName
selection, we may get some element names that aren't from this namespace. For
example, if our prefix is aa, then a qualified node
zz:XYZ will match the search. The collection
we obtain from the search is guaranteed to include all the names for which we
are searching, but may include other names as well. Consequently, before we
list any element names, we have to test the element's prefix against the
declaration's basename.
for (nj = 0; nj < rQualElements.length; nj++)
{
element = rQualElements(nj);
if (element.prefix == declaration.basename)
{
ListLine(results, element.basename, "blue", tabsize, linesize); linesize += 4;
}
}
We use a similar approach to obtain the qualified attribute
names:
rQualAttributes = parser.documentElement.selectNodes("//@*[nodeName() >= '"
+ declaration.basename + ":']");
// some formatting code here
for (nk = 0; nk < rQualAttributes.length; nk++)
{
attribute = rQualAttributes(nk);
if (attribute.prefix == declaration.basename)
{
ListLine(results, attribute.basename, "red", tabsize, linesize); linesize += 4;
We'll use the following XML for our test document:
We can clearly see the schema-process-engineering
namespace is used heavily. A human reader might be able to make something of
the names, particularly if the namespace creator used names descriptive of a particular
problem domain.
An automated agent might be given a list of namespace URIs
in which a user is interested. Given that and a usage listing such as we
produced above, the agent could assign a relevance priority to each document it
encounters. Namespaces alone don't give much metadata, but they do give us a
clue to what a document might be talking about. Here's the complete listing for
our enumeration function: