Write Better Modules with Quality DTDs
Consider four simple, powerful techniques for modeling
flexible and extensible XML Document Type Definitions
by Vijay Gummadi and Shantanu Dhar
August 2002 Issue
As more applications produce and consume XML documents, developers write more software modules or components that translate data to and from XML. But writing good modules is easier said than done. Static XML Document Type Definitions (DTDs) and inflexible modeling approaches increase the costs of reengineering and maintenance across each application that shares the component or module. As user requirements evolve and new functionality is added, the module's capabilities must be extended, which is not necessarily difficult or expensive if extensibility is designed in from the start. Moreover, a large variety of enterprise information systems are waiting to be integrated. Different data sources call for matching approaches for modeling DTDs. As projects adopt transcoding techniques, like the use of XSLT to transform XML documents for use by other systems, judicious DTD design makes for simpler and more manageable transformation files. (For a brief discussion of DTDs and schemas see the sidebar "DTDs and Schemas.")
Building high-quality DTDs makes applications more fault tolerant, allowing applications to survive changes to document schemas without requiring code changes. In other words, an application designed to consume a restrictive DTD is more likely to fail when it encounters documents conforming to a new version of that DTD. An application designed to consume a DTD written with software evolution in mind is less likely to fail under the same circumstance.
How can we make sure that we're getting the flexibility and extensibility we need? Four simple but powerful techniques for modeling flexible and extensible XML DTDs can improve efficient processing and storage. Early and late binding, elements and attributes, entities for extensibility, and containment and serialization each work better under certain circumstances. Let's see how we can use them to our best advantage.
Early and Late Binding
All situations do not require highly flexible DTDs. Consider this early bound DTD in which the properties of a book like Style and Title are already named and bound to data types and locations in the DTD:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (
Style, Title, Author+)>
<!ELEMENT Style (#PCDATA)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Author (
FirstName, LastName)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
This DTD is therefore suitable in situations that do not demand adding, removing, or renaming book properties. This XML document is based on this early bound DTD:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Book>
<Style>Textbook</Style>
<Title>Algebra I</Title>
<Author>
<FirstName>David</FirstName>
<LastName>Best</LastName>
</Author>
</Book>
<Book>
<Style>Novel</Style>
<Title>Oliver Twist</Title>
<Author>
<FirstName>Charles
</FirstName>
<LastName>Dickens</LastName>
</Author>
</Book>
</Books>
Early binding keeps a DTD easy to understand and makes for simple, compact XML documents. This simplicity allows easy downstream processing through parsing and procedural programming using Document Object Model (DOM) or Simple API for XML (SAX) interfaces (both are APIs for reading and/or manipulating XML documents) or XSLT. Because early binding is based on a predefined document structure and semantics, developers know exactly what to code toward. Further, early binding allows the DTD author to provide matching style sheets that can be used as templates for manipulation, thereby reducing development costs.
Now consider augmenting the XML document with new data: the price of each book. While this can be accomplished easily by modifying the DTD, it requires reprogramming downstream applications, even those that have no use for price data. In other words, evolving requirements translate into high application maintenance costs.
A solution to this problem is late binding, a DTD construction style in which property names are assigned dynamically:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (Property+, Author+)>
<!ELEMENT Author (Property+)>
<!ELEMENT Property (Name, Value)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Value (#PCDATA)>
Because applications that manipulate late bound XML are keyed to property names and not their relative location in the document, they require little or no reprogramming when new properties are added (see Listing 1). This flexibility could afford a dramatic reduction in maintenance cost because changes to an existing application are necessary only if it needs to use the new price data. Furthermore, late binding lends itself to parameterized programming, reducing the need for hard wiring property names in application code.
Both binding styles have their uses. Early binding works fine for static data models where change is not anticipated, while late binding is best suited to dynamic, evolving situations.
Elements and Attributes
The benefit of late binding is flexibility and extensibility. The cost is complexity and document bloat, both of which impose a processing penalty. Documents resulting from late binding require additional processing. Because you don't know the exact position of data, the code must search for the existence of specific names and values. DOM-based processing is based on reading the entire document into memory; hence, it does not work well for large documents. We need a way to manage document size, because late binding results in larger documents.
The use of XML attributes is an effective fix for the bloat problem. You can see how attributes may be used to reduce document size while retaining the flexibility of late binding. Here is the DTD:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (Property+, Author)>
<!ELEMENT Author (Property+)>
<!ELEMENT Property EMPTY>
<!ATTLIST Property
Name CDATA #REQUIRED
Value CDATA #REQUIRED
>
And here is an example of an XML document based on the XML attributes DTD:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Book>
<Property Name="Style" Value=
"Book"/>
<Property Name="Title" Value=
"Algebra I"/>
<Author>
<Property Name=
"FirstName" Value=
"David"/>
<Property Name=
"LastName" Value=
"Best"/>
</Author>
</Book>
</Books>
When there is need for robust validation of data types, attributes work better. Attributes can also be used effectively to simplify application programming and optimize processing load, while providing extensibility. Consider a situation in which an application parses an XML document containing categories of books and lists them. Here is an inflexible DTD for the task:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (
Periodical, Fiction)>
<!ELEMENT Book (#PCDATA)>
<!ATTLIST Book category CDATA
#REQUIRED>
<!ELEMENT Periodical (Book+)>
<!ELEMENT Fiction (Book+)>
and an example of an XML document based on it:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Periodical>
<Book category="Technical">
Advanced Manufacturing
</Book>
<Book category="Fashion">
Vogue</Book>
</Periodical>
<Fiction>
<Book category="Novel">
Oliver Twist</Book>
<Book category="Play">
Death of a Salesman</Book>
</Fiction>
</Books>
This DTD has two problems. The application must traverse two separate lists, Periodical and Fiction, which requires more programming effort and fails if a third list is added. Also, the DTD is not extensible; adding a new type beyond Periodical and Fiction requires major changes to the DTD that will break existing applications.
The XML attribute provides a simple and extensible solution for the problem. By wrapping both type and category data in an attribute list (ATTLIST), we can do away with multiple lists, provide much-needed extensibility, and reduce application development effort:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (#PCDATA)>
<!ATTLIST Book
category CDATA #REQUIRED
type CDATA #REQUIRED>
Attributes are always associated with elements. In XML documents they do not exist on their own:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Book category="Technical" type=
"Periodical">
Advanced Manufacturing</Book>
<Book category="Novel" type=
"Fiction">Oliver Twist</Book>
<Book category="Fashion" type=
"Periodical">Vogue</Book>
<Book category="Play" type=
"Fiction">Death of a Salesman
</Book>
<Book category="Literature" type=
"Novel">Catch 22</Book>
</Books>
The data contained in an attribute can be referenced only through the element that uses it. An effective rule for modeling attributes is to use them to describe an object modeled as an element rather than hold data that is core to the object. This approach offers flexibility—new data can be introduced through attributes without upsetting element structure. Also, attributes are an especially good fit in situations where validation of data type is a requirement.
Entities for Extensibility
XML-based standards are becoming common. Many industry groups are creating standard DTDs to represent data in their areas of interest. Organizations often choose to use these DTDs, but must augment them with data items specific to their business. A common mechanism for extending a DTD with little or no side effects is the parameterized entity. The ENTITY construct allows the augmentation of new ELEMENTs without disrupting the original DTD and permitting applications based on the original to keep running without changes. All that is required is that the entity names that serve as "hooks" be part of the original DTD. In this DTD you can see how BookProperties is a mere placeholder:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (Style, Title,
Author+, Price %BookProperties;)>
<!ELEMENT Style (#PCDATA)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Author (FirstName,
LastName)>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
<!ELEMENT Price (#PCDATA)>
%BookProperties;
You can see how it's described in its entirety in the corresponding XML document (see Listing 2).
Another application of the ENTITY construct is attribute reuse. If more than one element uses a common set of attributes, they can be wrapped in an entity, which in turn is referenced by the elements. Here's a simple example:
<!ELEMENT Book (Author, #PCDATA)>
<!ATTLIST Book
%ReusedAttrs;>
<!ELEMENT Author (#PCDATA)>
<!ATTLIST Author
%ReusedAttrs;>
<!ENTITY % ReusedAttrs
'id ID #IMPLIED
description CDATA #IMPLIED'>
This style forces the common naming of attributes where necessary, results in a compact DTD, and makes for efficiency and simplicity in the applications processing the resulting XML documents, such as:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Book id="b1" description=
"A novel">
<Author id="a2" description=
"19th Century Novelist">
Charles Dickens
</Author>Oliver Twist
</Book>
</Books>
Containment and Serialization
The development of a DTD constitutes two distinct activities: defining the data objects (ELEMENTS) and defining the relationships between them. One way to establish relationships in XML is through the mechanism of containment, which is when two objects have a relationship (for example, Book and Author) and one contains the other, as in our example of the early bound DTD. Containment offers the advantage of simplicity, readability, and relative ease of application development. However, it tends to be inflexible and difficult to maintain. Further, instead of reusing, containment-style XML duplicates data. The resulting redundancy could make highly repetitive data sets unattractive for storage and archiving. Serialization addresses this weakness.
The serialization mechanism is an exact opposite of containment in that elements do not encapsulate other elements:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ATTLIST Books
id ID #REQUIRED>
<!ELEMENT Book EMPTY>
<!ATTLIST Book
id ID #REQUIRED
Title-ref IDREF #REQUIRED
Author IDREF #REQUIRED>
<!ELEMENT Author EMPTY>
<!ATTLIST Author
id ID #REQUIRED
Name-ref IDREF #REQUIRED>
<!ELEMENT String (#PCDATA)>
<!ATTLIST String
id ID #REQUIRED>
Instead, they refer to each other through unique identifiers. This approach is called serialization because it is a simple way to serialize objects in an application at one end of a transaction and read them into another application at the other. Serialization offers storage efficiency and compactness for XML data that has multiple references between its elements, or if certain elements are referenced—that is, used by many other elements:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books id="b1">
<Book id="r1" Title-ref="r2"
Author="r3"/>
<Author id="r3" Name-ref="r5"/>
<String id="r2">Algebra I</String>
<String id="r5">David Best
</String>
</Books>
Another benefit of this mechanism is the ease with which serialized documents can be processed. It is straightforward to implement pure serialization policy in parsers by treating elements as objects and using the identifiers to link them, thereby recreating in memory the tree or graph structure represented in the document. This mechanism fits very well with object-oriented applications developed in Java or C++. A problem with serialization is that the XML generated is not good for human interpretation.
Some modeling situations can be addressed efficiently by a hybrid of the two mechanisms. Consider a situation in which many objects have a relationship with the same object:
<?xml version="1.0" encoding=
"UTF-8"?>
<!ELEMENT Books (Book+)>
<!ELEMENT Book (Title, Author+)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Author (FirstName?,
LastName?)>
<!ATTLIST Author
A-Id ID #REQUIRED
A-Ref IDREF #IMPLIED
>
<!ELEMENT FirstName (#PCDATA)>
<!ELEMENT LastName (#PCDATA)>
In this case, using containment results in repetition, which causes document bloat.
The problem can be solved readily by serializing those objects that could be referenced multiple times, while employing containment for other objects, as you see here:
<?xml version="1.0" encoding=
"UTF-8"?>
<Books>
<Book>
<Title>Oliver Twist</Title>
<Author A-Id="a1">
<FirstName>Charles
</FirstName>
<LastName>Dickens
</LastName>
</Author>
</Book>
<Book>
<Title>Tale of Two Cities
</Title>
<Author A-Id="a2" A-Ref =
"a1"/>
</Book>
</Books>
More and more application integration efforts use XML. Native XML databases soon will compare favorably with relational databases, if they don't already. These trends tell us that crafting DTDs and XML schemas will be important. The flexibility, extensibility, and storage efficiency of these models will have a big impact on the long-term development, maintenance, and operational costs of integrated systems.
Choose a Technique
These four simple but powerful techniques for modeling flexible and extensible XML DTDs can improve the quality of XML document processing. Each of these techniques works best in certain situations. Force fitting may cause more harm than good by reducing the complexity, performance, or storage requirements.
Choosing between early and late binding styles can be based on the degree of flexibility desired. In some cases, the choice between elements and attributes is clear. Attributes make for compact documents and help reduce storage requirements. Despite the complexity they add, attributes are useful in situations that demand validation of data types.
Entities are best used to extend DTDs and promote reuse of data elements. While containment offers a simple, easy-to-read structure, serialization offers the benefit of efficient processing and, in some cases, compact storage. In combination, these techniques are versatile enough to provide solutions to a variety of modeling problems and should form the core of any XML modeler's toolkit.
View resources: Techniques for Modeling XML DTDs
About the Authors
Vijay Gummadi is vice president, product development for Campfire Interactive in Ann Arbor, MI; has extensive experience in architecting enterprise-scale software solutions for the manufacturing sector; and actively pursues the application of pioneering concepts and technology to solving real-world problems at individual, organizational, and community levels. Shantanu Dhar is a principal consultant with Altarum in Ann Arbor, MI, specializing in IT strategy and interoperability, and spends a lot of his time working with the automotive supply chain. Reach Vijay at and Shantanu at .
|