Merge XML Documents
Take advantage of the XMLDocument object to combine data files from multiple sources.
by Kathleen Dollard
August 2003 Issue
Technology Toolbox: VB.NET, XML
XML is becoming an increasingly common data store. One of XML's advantages is that you can work with your data as discrete files that might come from different sources. You might also want to merge XML files to create reports, store the output in a database, or display it in a user interface. I'll show you how to do an intimate merge between discrete XML data files. I call it an intimate merge because it not only adds new elements, but it updates individual elements' attributes as well. Along the way, you'll get a better understanding of the role of XML Schema Definition (XSD) schemas, how to validate an XMLDocument object as you load it, and the role of recursion in processing XML data.
You might use this merging process in the real world by taking data from several external sources, such as a database, output from an accounting package such as QuickBooks, and the results of a Web-based survey (see Figure 1). I'll simplify the example to use data from the Customers and Orders tables from the SQL Server sample Northwind database and merge it with a file that mimics the result of a Web-based satisfaction survey. You can extend the approach to whatever data you work with.
Merging the three sample files directly won't work, because their structures are entirely different. The database-extracted files hold information as individual elements, and the survey holds information in attributes. You must perform two steps to merge them: Use Extensible Stylesheet Language Transformations (XSLT) to put them in similar formats, then perform the merge. This article's
sample code shows how to extract data from the database and perform the XSLT transformation. You can read more about XSLT in my article in the May 2003 issue.
XSD schemas articulate rules for the layout of XML, including which elements and attributes are legal. However, merge processing works best in the real world if you validate the XML first. Otherwise, small errors in the document structure can lead to bizarre output documents. I used a single XSD to validate all three input files. This is convenient, because it lets you know that all of your input XML files conform to a single set of rules. You don't need to create a schema from scratch; instead, you can right-click on an XML document to use VS.NET's "Create Schema" feature. VS.NET attempts to build a combined schema when multiple XML files in a project have the same namespace. However, it's not particularly good at this and sometimes drops attributes, so you might need to finish up your schema with a bit of cut-and-paste.
Use a Consistent Namespace
All three input files use the same namespace because they contain the same logical elements. If the elements in the different documents have different namespaces, the merge can't find the necessary matches, and you can't perform an intimate merge. The purpose of a namespace is to make a particular element unique. I own the domain http://KADGen, so I use it to ensure my element names don't conflict with someone else's. When you work with your own data, your XSLT preprocessing step often fixes namespaces, as you can see in the sample XSLT code.
You merge two files at a time, with the second file merged into the first. Load each file as a validated XML document:
Dim xmlBase As Xml.XmlDocument = _
Tools.LoadValidXML( _
"..\Customer.xml", _
"..\Merge.xsd", ns)
Dim xmlMerge As Xml.XmlDocument = _
Tools.LoadValidXML("..\Order.xml", _
"..\Merge.xsd", ns)
Tools.MergeRoot(xmlBase, xmlMerge)
The Tools class contains the important code for the merge (see Listing 1). The LoadValidXML method uses an XMLTextReader to access data in the file. The XMLValidatingReader uses the XMLTextReader to read the data and validate that the structure of the data is consistent with the specified schema. XML data is validated as it's read, not when you simply open the XMLValidatingReader. In this case, it's read as the XMLDocument object loads.
An XMLDocument holds your entire XML file in memory, lets you access any portion of the XML, and helps you maintain context. Context is your position within the document. The best analogy for context is navigating a DOS hierarchy. At any point in time, there's one current directory, and you can position yourself either relative to that directory or from the root. When you work with XML objects, the context is the current object you're working with, such as the outParent variable.
The MergeRoot method initiates the merge. It calls the recursive MergeNode method. Recursion is a fundamental programming tool in which a routine calls itself. Recursion is the best solution when you work with hierarchies of an unknown depth, such as XML. The first time you run across a recursive algorithm, it challenges you to adopt a new way of thinking. The important part of a recursive algorithm is a clear end point, so that the recursion doesn't continue forever. Supplying a clear end point is easy when you work with XML, because you progress down a tree, processing each node until you arrive at the leaf nodes.
MergeNode Looks for Matching Elements
MergeNode uses SelectSingleNode to look in the output document's current parent node for an element matching the current node in the merge document. It bases this match on the element name and on an ID attribute if one's available. The example uses a rule for the ID that consists of the name of the element followed by "ID," such as CustomerID for an element named Customer. If MergeNode doesn't find a matching element, it adds the current node in the merge document as a child of the current output parent node.
If MergeNode finds a matching element, the method checks each attribute and adds any attributes that don't exist. If the attribute exists, the attribute value in the merge document overwrites the one in the output document. This behavior means that the order of the merge matters, and the last document wins any conflicts. The method calls itself for each child element in the merge document, passing the current matching output element as the new parent. This process continues down the hierarchy to any depth. Leaf nodes have no child elements, so the recursion has a clear end point.
MergeNode uses an XMLNamespaceManager for the SelectSingleNode method. The input documents define a namespace, so you must define it in SelectSingleNode and the similar SelectNodes methods. The XMLNamespaceManager class doesn't support default namespaces. Default namespaces are namespaces that don't have a prefix, such as the elements in your input documents. The XMLNamespaceManager needs the prefix, but unfortunately it doesn't raise an exception if you pass an empty string. Instead, it simply neglects to find anything on the select. Avoid this problem by using a prefix in the namespace both when you add it to the XMLNamespaceManager and in the select expression. The namespace must match your input document, but the prefix can be anything as long as it's the same in the calls to the AddNamespace and SelectNodes methods. This prefix can be unrelated to prefixes in your input document.
Your own utility functions can be extremely helpful when you work with XML. The two samples in Listing 1 save only a few lines of code, but you'll use these and similar functions many times in your programming. Some of these methods, such as the GetAttributeOrEmpty method, can help you avoid runtime exceptions. This simple code raises an exception if the attribute doesn't exist:
s = _
node.Attributes.GetNamedItem( _
attrName).Value
The GetAttributesOrEmpty method returns an empty string without raising an exception.
You'll run across a variety of processing needs as you encounter XML more frequently in the coming months or years. XSLT handles many of these well—such as rearranging the structure of XML or changing XML to HTML for output. You can handle other challenges, such as an intimate merge, more effectively by using .NET code and working with XMLDocument objects.
About the Author
Kathleen Dollard is an independent consultant doing real-world development in .NET technologies. She's currently using XSLT techniques to generate 450 classes in a 300+ KLOC project. She's active in the Denver Visual Studio User Group and is a regular contributor to , a Microsoft MVP, and a VSLive! speaker. Reach Kathleen at .
|