XPath, unleashed — coming to Internet Explorer 5+ HTML DOM near you

While coding away in JavaScript, reshaping/augmenting your HTML code using DOM, have you ever wondered why there is no support for XPath built-in? Actually, there is — Mozilla has a pretty solid support of DOM Level 3 XPath right at your fingertips through the document.evaluate method:

document.evaluate(expression, contextNode, resolver, type, result);

You’ll find the details of implementation over at the w3.org or mozilla.org, but for starters, expression is the XPath expression string and contextNode is the DOM node that you’d like to use as a root. The rest can be (and most often will be) specified as zeros and nulls. For instance, this expression will get you all div nodes that have class attribute set to DateTime in your document:


var iterator = document.evaluate(”//div[@class='DateTime']”, document, null, 0, null);

By default, the method returns an iterator, which can be worked through like so:


while(item = iterator.iterateNext())
{
// do something with item
}

As you might’ve guessed, the iterator returns null once all items are exhausted. By modifying the type parameter, you can make the method return other types, such as string, boolean, number, and a snapshot. Snapshot is kind of like an iterator, except the DOM is free to change while the snapshot still exists. If you try to do the same with the iterator, it will throw an exception.

Well, I thought that it is mighty unfair that Internet Explorer does not support such functionality. I mean, you can very much do XPath in JavaScript, except it can only occur in two cases (that I know of):

1) As call to an Msxml.DOMDocument object, created using the new ActiveXObject() statement.

2) If an HTML document was generated as a result of a client-side XSL transformation from an XML file.

Neither case offers us a solution if we want to use XPath in a plain-vanilla HTML. So, I decided to right the wrong. Here is the first stab at it — a JavaScript implementation of DOM Level 3 XPath for Microsoft Internet Explorer (all zipped up for your review). Here is the sample which should run in exactly the same way in IE and Mozilla.

Now counting all links on your document is just one XPath query:


var linkCount = document.evaluate(“count(//a[@href])“, document, null, XPathResult.NUMBER_TYPE, null).getNumberValue();

So is getting a list of all images without an alt tag:


var imgIterator = document.evaluate(“//img[not(@alt)]“, document, null, XPathResult.ANY_TYPE, null);

So is finding a first LI element of al UL tags:


var firstLiIterator = document.evaluate(“//ul/li[1]“, document, null, XPathResult.ANY_TYPE, null);

In my opinion, having XPath in HTML DOM opens up a whole new level of flexibility and just plain coding convenience for JavaScript developers.

I must say, I haven’t been able to resolve all implementation issues yet. For example, I couldn’t find a pretty way to implement properties of XPathResult. How do you make a property accessor that may throw an exception in JScript? As a result, I had to fall back to the Java model of binding to properties.

So guys, take a look. I can post more on details of implementation, if you’d like. Just let me know.

5 thoughts on “XPath, unleashed — coming to Internet Explorer 5+ HTML DOM near you”

  1. Hi Dimitri,

    This looks very interesting. Do you have the source code somewhere still? The link to it is broken.

    Regards, Tom

  2. Hi Dimitri,

    This was exactly what I needed! I’ve used it in a C# WinForms application to allow me to parse HTML with the Html Agility Pack (built-in XPath support), while allowing me to select and modify specific elements (without ids) that are previewed in a WebBrowser control.

    The only modification I had to make to the code was adding a line to the _XPathMsxmlDocumentHelper activateDom function.

    dom.setProperty(“SelectionLanguage”, “XPath”);

    Apparently the default selection language uses 0-based indicies (i.e. “//p[0]/a[0]” to select the first anchor in the first paragraph) whereas XPath is defined as using 1-based indicies (i.e. “//p[1]/a[1]” to select the first anchor in the first paragraph). There may be other differences, but this was the one that tripped me up.

    I found the setting at this URL in the JavaScript example:
    http://msdn.microsoft.com/en-us/library/windows/desktop/ms754523(v=vs.85).aspx#snippetGroup
    (Search: Msxml2.DOMDocument selectNodes)

    I’m not sure if it was intentional, so I figured I’d just raise the point in case it wasn’t.

    Regards,
    Trevor

  3. hi This is exactly what I needed! but the links “here “are broken . where can i get the solution ? thanks

Leave a Reply to dglazkovCancel reply

Discover more from Dimitri Glazkov

Subscribe now to keep reading and get access to the full archive.

Continue reading