This package contains Ælfred2, which includes an
enhanced SAX2-compatible version of the Ælfred
non-validating XML parser and a modular (and hence optional)
DTD validating parser. Use them like any other SAX2 parsers.
Some of the documentation below was modified from the original
Ælfred README.txt file. All of it has been updated.
Ælfred is a Java-based XML parser originally from
Microstar Software Limited (no longer in existence) and
more or less placed into the public domain.
In most Java applets and applications, XML should not be the central
feature; instead, XML is the means to another end, such as loading
configuration information, reading meta-data, or parsing transactions.
When an XML parser is only a single component of a much larger
program, it cannot be large, slow, or resource-intensive. With Java
applets, in particular, code size is a significant issue. The standard
modem is still not operating at 56 Kbaud, or sometimes even with data
compression. Assuming an uncompressed 28.8 Kbaud modem, only about
3 KBytes can be downloaded in one second; compression often doubles
that speed, but a V.90 modem may not provide another doubling. When
used with embedded processors, similar size concerns apply.
Ælfred is designed for easy and efficient use over the Internet,
based on the following principles:
- Ælfred must be as small as possible, so that it doesn't add too
much to an applet's download time.
- Ælfred must use as few class files as possible, to minimize the
number of HTTP connections necessary. (The use of JAR files has made this
be less of a concern.)
- Ælfred must be compatible with most or all Java implementations
and platforms. (Write once, run anywhere.)
- Ælfred must use as little memory as possible, so that it does
not take away resources from the rest of your program. (It doesn't force
you to use DOM or a similar costly data structure API.)
- Ælfred must run as fast as possible, so that it does not slow down
the rest of your program.
- Ælfred must produce correct output for well-formed and valid
documents, but need not reject every document that is not valid or
not well-formed. (In Ælfred2, correctness was a bigger concern
than in the original version; and a validation option is available.)
- Ælfred must provide full internationalization from the first
release. (Ælfred2 now automatically handles all encodings
supported by the underlying JVM; previous versions handled only
UTF-8, UTF_16, ASCII, and ISO-8859-1.)
As you can see from this list, Ælfred is designed for production
use, but neither validation nor perfect conformance was a requirement.
Good validating parsers exist, including one in this package,
and you should use them as appropriate. (See conformance reviews
available at http://www.xml.com)
One of the main goals of Ælfred2 was to significantly improve
conformance, while not significantly affecting the other goals stated above.
Since the primary use of this parser is with SAX, some classes could be
removed, and so the overall size of Ælfred was actually reduced.
Subsequent performance work produced a notable speedup (over twenty
percent on larger files). That is, the tradeoffs between speed, size, and
conformance were re-targeted towards conformance and support of newer APIs
(SAX2), with a a positive performance impact.
The role anticipated for this version of Ælfred is as a
lightweight Open Source SAX parser that can be used in essentially every
Java program where the handful of conformance violations (noted below)
are acceptable.
That certainly includes applets, and
nowadays one must also mention embedded systems as being even more
size-critical.
At this writing, all parsers that are more conformant are
significantly larger, even when counting the optional
validation support in
this version of Ælfred.
Ælfred the Great (AElfred in ASCII) was King of Wessex, and
some say of King of England, at the time of his death in 899 AD.
Ælfred introduced a wide-spread literacy program in the hope that
his people would learn to read English, at least, if Latin was too
difficult for them. This Ælfred hopes to bring another sort of
literacy to Java, using XML, at least, if full SGML is too difficult.
The initial Æ ligature ("AE)" is also a reminder that XML is
not limited to ASCII.
The Ælfred parser currently builds in support for a handful
of input encodings. Of course these include UTF-8 and UTF-16, which
all XML parsers are required to support:
- UTF-8 ... the standard eight bit encoding, used unless
you provide an encoding declaration.
- US-ASCII ... an extremely common seven bit encoding,
which happens to be a subset of UTF-8 and ISO-8859-1 as well
as many other encodings. XHTML web pages using US-ASCII
(without an encoding declaration) are probably more
widely interoperable than those in any other encoding.
- ISO-8859-1 ... includes accented characters used in
much of western Europe.
- UTF-16 ... with several variants, this encodes each
sixteen bit Unicode character in sixteen bits of output.
Variants include UTF-16BE (big endian, no byte order mark),
UTF-16LE (little endian, no byte order mark), and
ISO-10646-UCS-2 (an older and less used encoding, using a
version of Unicode without surrogate pairs).
- ISO-10646-UCS-4 ... a seldom-used four byte encoding,
with four different byte orderings. Some operating systems
standardized on UCS-4 despite its significant size penalty,
in anticipation that Unicode (even with surrogate pairs)
would eventually become limiting.
If you use any encoding other than UTF-8 or UTF-16 you should
make sure to label your data appropriately:
<?xml version="1.0" encoding="ISO-8859-1"?>
Encodings accessed through java.io.InputStreamReader
are now fully supported for both external labels (such as MIME types)
and internal types (as shown above).
Known conformance issues should be of negligible importance for
most applications, and include:
- Rather than following the voluminous "Appendix A" rules about
what characters may appear in names (and name tokens), the Unicode
rules are used. This means that some names are inappropriately
rejected, and others are inappropriately accepted. (XML has some
very complicated rules in this area. It's much simpler
to avoid that much special case code, although there is at last
an open routine available that I'll trust to get it right.)
- Text containing "]]>" is not rejected unless it fully resides
in an internal buffer ... which is, thankfully, the typical case. This
text is illegal, but sometimes appears in illegal attempts to
nest CDATA sections. (Not catching that boundary condition
substantially simplifies parsing text.)
- Surrogate characters that aren't correctly paired are ignored
rather than rejected. (This simplifies parsing text; in any case,
no documents today should use Unicode surrogates, since at this writing
there are no formal assignments for those character codes.
Even JDK 1.3 doesn't handle such issues well.)
- Declarations following references to an undefined parameter
entity reference are not ignored. (Not maintaining and using state
about this validity error simplifies declaration handling.)
- Tab characters in PUBLIC identifiers are treated like ordinary
whitespace, rather than causing fatal errors. (Removing this
special case keeps the parser simpler.)
- Well formedness constraints for general entity references
are not enforced. (The code to handle the "content" production
is merged with the element parsing code, making it hard to reuse
for this additional situation.)
When tested against the July 12, 1999 version of the OASIS
XML Conformance test suite, an earlier version passed 1057 of 1067 tests.
That contrasts with the original version, which passed 867. The
current parser is top-ranked in terms of conformance, as is its
validating sibling (which has some additional conformance violations
imposed on it by SAX2 API deficiencies as well as some of the more
curious SGML layering artifacts found in the XML specification).
As noted above, the original distribution was more or less
public domain. The license had the constraint that modifications
be clearly documented, as has been done here.
This version is Copyright (c) 1999-2000 by David Brownell,
and all the modifications are distributed under the GNU General
Public License (GPL).
As noted above, Microstar has not updated this parser since
the summer of 1998, when it released version 1.2a on its web site.
This release is intended to benefit the developer community by
refocusing the API on SAX2, and improving conformance to the extent
that most developers should not need to use another XML parser.
The code has been cleaned up (referring to the XML 1.0 spec in
all the production numbers in
comments, rather than some preliminary draft, for one example) and
has been sped up a bit as well.
The original version of Ælfred did not support the
SAX2 APIs.
This version supports the SAX2 APIs, exposing the standard
boolean feature descriptors. It supports the "DeclHandler" property
to provide access to all DTD declarations not already exposed
through the SAX1 API. The "LexicalHandler" property is supported,
except that entity references are hidden; this means you can see
things like comments and CDATA boundaries. SAX1 compatibility is
currently provided.
In the 'pipeline' package in this same software distribution is an
XML Validation component
using any full SAX2 event stream (including all document type declarations)
to validate. There is now a Validator class
which combines that class and this enhanced Ælfred parser, creating
a validating parser.
As noted in the documentation for that validating component, certain
validity constraints can't be tested. These include all those relying on
layering violations (exposing XML at the level of tokens or below,
required since XML isn't a context-free grammar), some that
SAX2 doesn't support, and a few others. The resulting validating
parser is conformant enough for most applications that aren't doing
strange SGML tricks with DTDs.
Moreover, that validating filter can be used without
a parser ... any application component that emits SAX event streams
can DTD-validate its output on demand.
You'll have noticed that the original version of Ælfred
had small size as a top goal. Ælfred2 normally includes a
DTD validation layer, but you can package without that.
Then the main added cost due to this revision are for
supporting the SAX2 API itself; DTD validation is as
cleanly layered as allowed by SAX2.
Bugs fixed in Ælfred2 include:
- Originally Ælfred didn't close file descriptors, which
led to file descriptor leakage on programs which ran for any
length of time.
- NOTATION declarations without system identifiers are
now handled correctly.
- DTD events are now reported for all invocations of a
given parser, not just the first one.
- More correct character handling:
- Rejects out-of-range characters, both in text and in
character references.
- Correctly handles character references that expand to
surrogate pairs.
- Correctly handles UTF-8 encodings of surrogate pairs.
- PUBLIC identifiers are now rejected if they have illegal
characters.
- The parser is more correct about what characters are allowed
in names and name tokens. Uses Unicode rules (built in to Java)
rather than the voluminous XML rules, although some extensions
have been made to match XML rules more closely.
- Line ends are now normalized to newlines in all known
cases.
- Certain validity errors were previously treated as well
formedness violations.
- Repeated declarations of an element type are no
longer fatal errors.
- Undeclared parameter entity references are no longer
fatal errors.
- Attribute handling is improved:
- Whitespace must exist between attributes.
- Only one value for a given attribute is permitted.
- ATTLIST declarations don't need to declare attributes.
- Attribute values are normalized when required.
- Tabs in attribute values are normalized to spaces.
- Attribute values containing a literal "<" are rejected.
- More correct entity handling:
- Whitespace must precede NDATA when declaring unparsed
entities.
- Parameter entity declarations may not have NDATA annotations.
- The XML specification has a bug in that it doesn't specify
that certain contexts exist within which parameter entity
expansion must not be performed. Lacking an offical erratum,
this parser now disables such expansion inside comments,
processing instructions, ignored sections, public identifiers,
and parts of entity declarations.
- Entity expansions that include quote characters no longer
confuse parsing of strings using such expansions.
- Whitespace in the values of internal entities is not mapped
to space characters.
- General Entity references in attribute defaults within the
DTD now cause fatal errors when the entity is not defined at the
time it is referenced.
- Malformed general entity references in entity declarations are
now detected.
- Neither conditional sections
nor parameter entity references within markup declarations
are permitted in the internal subset.
- Processing instructions whose target names are "XML"
(ignoring case) are now rejected.
- Comments may not include "--".
- Most "]]>" sequences in text are rejected.
- Correct syntax for standalone declarations is enforced.
- Setting a locale for diagnostics only produces an exception
if the language of that locale isn't English.
- Some more encoding names are recognized. These include the
Unicode 3.0 variants of UTF-16 (UTF-16BE, UTF-16LE) as well as
US-ASCII and a few commonly seen synonyms.
- Text (from character content, PIs, or comments) large enough
not to fit into internal buffers is now handled correctly even in
some cases which were originally handled incorrectly.
- Content is now reported for element types for which attributes
have been declared, but no content model is known. (Such documents
are invalid, but may still be well formed.)
Other bugs may also have been fixed.
For better overall validation support, some of the validity
constraints that can't be verified using the SAX2 event stream
are now reported directly by Ælfred2.