Published on ONJava.com (http://www.onjava.com/)
 See this if you're having trouble printing code examples

O'Reilly Book Excerpts: Learning Java, 2nd Edition

XML Basics for Java Developers, Part 4

by Patrick Niemeyer and Jonathan Knudsen

In part four in a series of XML basics for Java developers book excerpts from
Learning Java, 2nd Edition, learn about validating documents.

Related Reading

Learning Java
By Patrick Niemeyer, Jonathan Knudsen

Validating Documents

"Words, words, mere words, no matter from the heart."
William Shakespeare, Troilus and Cressida

In this section, we talk about DTDs and XML Schema, two ways to enforce rules an XML document must follow. A DTD is a grammar for an XML document, defining which tags may appear where and in what order, with what attributes, etc. XML Schema is the next generation of DTD. With XML Schema, you can describe the data content of the document in terms of primitives such as numbers, dates, and simple regular expressions. The word schema means a blueprint or plan for structure, so we'll refer to DTDs and XML Schema collectively as schema where either applies

Now for a reality check. Unfortunately, Java support for XML Schema isn't entirely mature at the time of this writing. XML support in Java 1.4.0 is based on the Apache Project's Crimson parser (which in turn is based on Sun's "Project X" parser). The Crimson engine doesn't support XML Schema. However, a future release of Java will migrate the XML implementation to the Apache Xerces2 engine, and at that time, XML Schema should begin to be supported.

Using Document Validation

XML's validation of documents is a key piece of what makes it useful as a data format. Using a schema is somewhat analogous to the way Java classes enforce type checking in the language. Schema define document types. Documents conforming to a given schema are often referred to as instance documents.

This type safety provides a layer of protection that eliminates having to write complex error-checking code. However, validation may not be necessary in every environment. For example, when the same tool generates XML and reads it back, validation should not be necessary in normal operation. It is invaluable, though, during development. Often, document validation is used during development and turned off in production environments.


The Document Type Definition language is fairly simple. A DTD is primarily a set of special tags that define each element in the document and, for complex types, provide a list of the elements it may contain. The DTD <!ELEMENT> tag consists of the name of the tag and either a special keyword for the data type or a parenthesized list of elements.

<!ELEMENT Document ( Head, Body )>

The special identifier #PCDATA indicates character data (a string). When a list is provided, the elements are expected to appear in that order. The list may contain sublists, and items may be made optional using a vertical bar (|) as an OR operator. Special notation can also be used to indicate how many of each item may appear; a few examples of this notation are shown in Table 23-2.

In This Series

XML Basics for Java Developers, Part 5
In this final in a series of XML basics for Java developers book excerpts from Learning Java, 2nd Edition, get an introduction to XSL/XSLT and Web services.

XML Basics for Java Developers, Part 3
In part three in this series of book excerpts on XML basics for Java developers from Learning Java, 2nd Edition, learn about the Document Object Model (DOM).

XML Basics for Java Developers, Part 2
In this second part in a several part series on XML for Java developers from Learning Java, 2nd Edition, learn about SAX and the SAX API.

XML Basics for Java Developers, Part 1
This is the first in a series of book excerpts on XML for Java developers from Learning Java, 2nd Edition. This excerpt covers XML fundamentals.

Table 23-2. DTD notation defining occurrences
Character Meaning
* Zero or more occurrences
? Zero or one occurrences
+ One or more occurrences

Attributes of an element are defined with the <!ATTLIST> tag. This tag enables the DTD to enforce rules about attributes. It accepts a list of identifiers and a default value:

<!ATTLIST Animal class (unknown | mammal | reptile) "unknown">

This ATTLIST says that the Animal element has a class attribute that can have one of three values: unknown, mammal, or reptile. The default is unknown.

We won't cover everything you can do with DTDs here. But the following example will guarantee zooinventory.xml follows the format we've described. Place the following in a file called zooinventory.dtd (or grab this file from the CD-ROM or web site for the book):

<!ELEMENT Inventory ( Animal* )>
<!ELEMENT Animal (Name, Species, Habitat, (Food | FoodRecipe), Temperament)>
<!ATTLIST Animal class (unknown | mammal | reptile) "unknown">
<!ELEMENT Species ( #PCDATA )>
<!ELEMENT Habitat ( #PCDATA )>
<!ELEMENT FoodRecipe ( Name, Ingredient+ )>
<!ELEMENT Ingredient ( #PCDATA )>
<!ELEMENT Temperament ( #PCDATA )>

The DTD says that an Inventory consists of any number of Animal elements. An Animal has a Name, Species, and Habitat tag followed by either a Food or FoodRecipe. FoodRecipe's structure is further defined later.

To use our DTD, we must associate it with the XML document. We do this by placing a DOCTYPE declaration in the XML itself. When a validating parser encounters the DOCTYPE, it attempts to load the DTD and validate the document. There are several forms the DOCTYPE can have, but the one we'll use is:

<!DOCTYPE Inventory SYSTEM "zooinventory.dtd">

Both SAX and DOM parsers can automatically validate documents that contain a DOCTYPE declaration. However, you have to explicitly ask the parser factory to provide a parser that is capable of validation. To do this, set the validating property of the parser factory to true before you ask it for an instance of the parser. For example:

SAXParserFactory factory = SAXParserFactory.newInstance(  );
factory.setValidating( true );

Try inserting the setValidating( ) line in our model builder example at the location indicated above. Now abuse the zooinventory.xml file by adding or removing an element or attribute and see what happens when you run the example.

To really use the validation, we would have to register an org.xml.sax.ErrorHandler object with the parser, but by default Java installs one that simply prints the errors for us.

XML Schema

Although DTDs can define the basic structure of an XML document, they can't adequately describe data and validate it programmatically. The evolving XML Schema standard is the next logical step and should replace DTDs in the near future. For more information about XML Schema, see http://www.w3.org/XML/Schema. As mentioned earlier, we expect an upcoming Java release to support XML Schema.

JAXB and Code Generation

The ultimate goal of XML will be reached by automated binding of XML to Java classes. There are several tools today that provide this, but they are hampered by the slow adoption of XML Schema.

The standard Java solution is the forthcoming Java XML Binding (JAXB) project. Unfortunately, at the time of this writing, JAXB is not mature. It is difficult to use and doesn't support XML Schema (necessary to fully describe document content). JAXB also requires its own "binding" language to be used, even for simple cases. We hope that the final release of JAXB will provide a good solution for XML binding. You can find information about JAXB at http://java.sun.com/xml/jaxb.

Unlike JAXB, Castor, an open source XML binding framework for Java, works with XML Schema and is relatively easy to use. Unfortunately, at the time of this writing, Castor doesn't support DTDs, and most industry- or task-specific XML standards are still written in terms of DTDs. You can find out more about Castor at http://www.castor.org/.

In the next installment, we conclude this book excerpt series with an introduction to XSL/XSLT and Web services.

Learning Java

Related Reading

Learning Java
By Patrick Niemeyer, Jonathan Knudsen

Return to ONJava.com.

Copyright © 2009 O'Reilly Media, Inc.