foo

This is a Free Software DOM Level 2 implementation, supporting these features: "XML", "Events", "MutationEvents", "HTMLEvents" (won't generate them though), "UIEvents" (also won't generate them), "USER-Events" (a conformant extension), and "Traversal" (optional; no TreeWalker yet). It is intended to be a reasonable base both for experimentation and supporting additional DOM modules as clean layers. You may want to read more about this package or about its features, or about the use of the Sub-DOM extension approach. You should read about DOM functionality to avoid.

About this Package

Design Goals

A number of DOM implementations are available in Java, including commercial ones from Sun, IBM, Oracle, and DataChannel as well as noncommercial ones from Docuverse, OpenXML, and Silfide. Why have another? Some of the goals of this version:

DOM Level 2 support. This was the first generally available implementation of DOM Level 2 in Java.
As extensible as practical. Most things are public and non-final, so new versions of DOM (supporting more of the optional Level 2 features) should get easily layered.
Second implementation syndrome. I can do it simpler this time around ... and heck, writing it only takes a bit over a day once you know your way around. (Debugging is of course a different story, since W3C hasn't provided any conformance test suite and publicly available suites have poor coverage.)
Sanity check the current (last call) L2 draft. Best to find bugs now, when they're relatively fixable. Yes, bugs were found and are still getting reported. Fixes are TBD.
Modularity. Most of the implementations mentioned above are part of huge packages; take all (including bugs, of which some have far too many), or take nothing. I prefer a menu approach, when possible. This code is standalone, not beholden to any particular parser or XSL or XPath code.
OK, I'm a hacker, I like to write code.

It's also on the agenda to make sure this works well with the new Gnu Compiler for Java (GCJ). When it's done, GCJ promises to be quite the environment for programming Java, both directly and from C++ using the new CNI interfaces (which really use C++, unlike JNI).

Open Issues

At this writing:

An approximation of XML rules for legal names is used (Unicode rules with minor tweaks) rather than the huge character tables in the XML appendix.
See below for some restrictions on the mutation event support ... some events aren't reported (and likely won't be).
There's no implementation of the TreeWalker traversal API.
The whole thing is new and not fully tested. (However, I will gladly accept patches!)

I ran a profiler a few times and remove some of the performance hotspots, but it's not tuned. Reporting mutation events, in particular, is rather costly -- it started at about a 40% penalty for appendNode calls, I've got it down around 12%, but it'll be hard to shrink it much further. The overall code size is relatively small, though you may want to be rid of many of the unused DOM interface classes (HTML, CSS, and so on).

Features of this Package

Starting with DOM Level 2, you can really see that DOM is constructed as a bunch of optional modules around a core of either XML or HTML functionality. Different implementations will support different optional modules. This implementation provides a set of features that should be useful if you're not depending on the HTML functionality (lots of convenience functions that don't often buy much except API surface area) and user interface support. That is, browsers will want more -- but what they need should be cleanly layered over what's already here.

Core Feature Set: "XML"

This DOM implementation supports the "XML" feature set, which basically gets you four things over the bare core (which you're officially not supposed to implement except in conjunction with the "XML" or "HTML" feature). In order of decreasing utility, those four things are:

ProcessingInstruction nodes. These are probably the most valuable thing. Handy little buggers, in part because all the APIs you need to use them are provided, and they're designed to let you escape XML document structure rules in controlled ways.
CDATASection nodes. These are of of limited utility since CDATA is just text that prints funny. These are of use to some sorts of applications, though I encourage folk to not use them.
DocumentType nodes, and associated Notation and Entity nodes. These appear to be useless. Briefly, these "Type" nodes expose no typing information. They're only really usable to expose some lexical structure that almost every application needs to ignore. (XML editors might like to see them, but they need true typing information much more.) I strongly encourage people not to use these.
EntityReference nodes can show up. These are actively annoying, since they add an extra level of hierarchy, are the cause of most of the complexity in attribute values, and their contents are immutable. Avoid these.

Optional Feature Sets: "Events", and friends

Events may be one of the more interesting new features in Level 2. This package provides the core feature set and exposes mutation events. No gooey events though; if you want that, write a layered implementation!

Three mutation events aren't currently generated:

DOMSubtreeModified is poorly specified. Think of this as generating one such event around the time of finalization, which is a fully conformant implementation. This implementation is exactly as useful as that one.
DOMNodeRemovedFromDocument and DOMNodeInsertedIntoDocument are supposed to get sent to every node in a subtree that gets removed or inserted (respectively). This can be extremely costly, and the removal and insertion processing is already significantly slower due to event reporting. It's much easier, and more efficient, to have a listener higher in the tree watch removal and insertion events through the bubbling or capture mechanisms, than it is to watch for these two events.

In addition, certain kinds of attribute modification aren't reported. A fix is known, but it couldn't report the previous value of the attribute. More work could fix all of this (as well as reduce the generally high cost of childful attributes), but that's not been done yet.

Also, note that it is a Bad Thing™ to have the listener for a mutation event change the ancestry for the target of that event. Or to prevent mutation events from bubbling to where they're needed. Just don't do those, OK?

As an experimental feature (named "USER-Events"), you can provide your own "user" events. Just name them anything starting with "USER-" and you're set. Dispatch them through, bubbling, capturing, or what ever takes your fancy. One important thing you can't currently do is pass any data (like an object) with those events. Maybe later there will be a "UserEvent" interface letting you get some substantial use out of this mechanism even if you're not "inside" of a DOM package.

You can create and send HTML events. Ditto UIEvents. Since DOM doesn't require a UI, it's the UI's job to send them; perhaps that's part of your application.

This package may be built without the ability to report mutation events, gaining a significant speedup in DOM construction time. However, if that is done then certain other features -- notably node iterators and getElementsByTagname -- will not be available.

Optional Feature: "Traversal"

Each DOM node has all you need to walk to everything connected to that node. Lightweight, efficient utilities are easily layered on top of just the core APIs.

Traversal APIs are an optional part of DOM Level 2, providing a not-so-lightweight way to walk over DOM trees, if your application didn't already have such utilities for use with data represented via DOM. Implementing this helped debug the (optional) event and mutation event subsystems, so it's provided here.

At this writing, the "TreeWalker" interface isn't implemented.

Creating "Sub-DOMs"

In much the way that a base class is able to create subclasses, it is possible to create a "Sub-DOM" from this implementation. This is a kind of "framework", consisting of this implementation and several subclasses of its base classes. Using the example of the HTML DOM, this might look like a package with:

A DomImplementation subclass claiming to implement the HTML feature, disallowing creation of DocumentType nodes, implementing the HTMLDOMImplementation interface, and returning instances of a custom HTMLDocument subclass;
The DomDocument class would be subclassed to implement more methods (see below on why this is generally a bad idea) and so that it would't create CDATASection, ProcessingInstruction, or EntityReference nodes. Most importantily, when creating new elements, it would use the tag name to construct elements that implement the HTMLElement or its custom interfaces, normalizing case, and reject any non-HTML nodes.
There would be almost sixty DomElement subclasses that implement the various specialized HTML DOM interfaces.

An XHTML DOM might look similar but would use the XHTML namespace to choose whether to use those custom subclasses, and would neither normalize the XHTML element and attribute names nor reject non-XHTML nodes. See the separate XHTML DOM package for one such layered implementation.

An SMIL, XSLT or MathML DOM could use that approach. When considering construction of such a customized DOM, some of the perceived benefits are that the custom subclasses can provide additional methods that are specialized to a particular application domain. Another is that they can embed knowledge about how to validate their contents, and use that to help prevent related application errors from reaching the outside world. For example, they might be able to report violations of validity rules. Some DOM implementations support a separate "DTD Compiler" application, which can generate the customized subclasses and arrange to have a Document class use those.

Note that because an XML document is a container for typed data (and in fact for that data's type description), it is best to avoid relying on specialized methods on that container. Basically, if the container itself exposes type-specific operations, then you need to choose the container type before you know the document type; that clearly can't work in the typical case. The precedent in the HTML DOM is not a good one to follow.

DOM Functionality to Avoid

For what appear to be a combination of historical and "committee logic" reasons, DOM has a number of features which I strongly advise you to avoid using in your library and application code. These include the following types of DOM nodes; see the documentation for the implementation class for more information:

CDATASection (DomCDATA class) ... use normal Text nodes instead, so you don't have to make every algorithm recognize multiple types of character data
DocumentType (DomDocType class) ... if this held actual typing information, it might be useful
Entity (DomEntity class) ... neither parsed nor unparsed entities work well in DOM; it won't even tell you which attributes identify unparsed entities
EntityReference (DomEntityReference class) ... permitted implementation variances are extreme, all children are readonly, and these can interact poorly with namespaces
Notation (DomNotation class) ... only really usable with unparsed entities (which aren't well supported; see above) or perhaps with PIs after the DTD, not with NOTATION attributes

If you really need to use unparsed entities or notations, use SAX; it offers better support for all DTD-related functionality. It also exposes actual document typing information (such as element content models).

Also, when accessing attribute values, use methods that provide their values as single strings, rather than those which expose value substructure (Text and EntityReference nodes). (See the DomAttr documentation for more information.)

Note that many of these features were provided as partial support for editor functionality (including the incomplete DTD access). Full editor functionality requires access to potentially malformed lexical structure, at the level of unparsed tokens and below. Access at such levels is so complex that using it in non-editor applications sacrifices all the benefits of XML; editor aplications need extremely specialized APIs.

(This isn't a slam against DTDs, note; only against the broken support for them in DOM. Even despite inclusion of some dubious SGML legacy features such as notations and unparsed entities, and the ongoing proliferation of alternative schema and validation tools, DTDs are still the most widely adopted tool to constrain XML document structure. Alternative schemes generally focus on data transfer style applications; open document architectures comparable to DocBook 4.0 don't yet exist in the schema world. Feel free to use DTDs; just don't expect DOM to help you.)