This blog on XML database technology and DB2 pureXML has been active for almost 5 years now. My former colleage Conor O’Mahony ran this blog from early 2008 to December 2009, and I took over on Dec 15, 2009. We have received a lot of feedback and many of the blog posts were inspired by questions from XML database users.

I’m not sure yet how much time I will have for blogging in 2013 and beyond. In any case, I liked the idea of producing a summary page that provides a quick overview of all the topics and posts in this blog. But, how to produce such a summary most efficiently?

Well, true to the topic of this blog I decided to load this blog as a piece of XML into DB2 pureXML and run queries on it!

First I downloaded the complete blog as an HTML document. But, HTML is not necessarily wellformed XML because HTML allows elements with missing end tags and other goofy things. Luckily there are free tools available that convert HTML into XHTML. XHTML is a form of HTML that complies with the rules for well-formed XML documents.

After converting the entire blog into an XML (XHTML) document, it took only three steps to create a list of blog post titles, dates, and URLs:

CREATE TABLE nativexmldatabase(doc XML);


LOAD FROM blog.del OF DEL REPLACE INTO nativexmldatabase(doc);


SELECT date, title, url
FROM nativexmldatabase,
   XMLTABLE('$DOC//div[@class="entrytitle"]'
   COLUMNS
     seq     FOR ORDINALITY,
     date    VARCHAR(20)   PATH 'h3',
     title   VARCHAR(105)  PATH 'h2/a',
     url     VARCHAR(145)  PATH 'h2/a/@href') AS T;

Of course, a quick look into the XML document is necessary to discover the required tag names and XPath expressions, such as //div[@class=”entrytitle”] to iterate over all the individual posts, or h2/a and h2/a/@href to retrieve the title and URL of each post.

This query produces a nice list of all blog post titles, dates, and URLs. Instead of posting this listing as-is, I decided to roughly group the blog posts along the following topic areas:

* Design
* Performance
* Migration
* Relational to XML or XML to relational
* XML Queries
* XML Storage and Indexing
* XML Schemas
* Application Development
* Use Cases and Case Studies
* ETL and Warehousing
* DB2 pureXML Product News
* Other Products and Tools
* Publications and Resources
* XML Coverage at IDUG and IOD Conferences
* News, Announcements, Miscellaneous

And here is the result, a table of contents for nativexmldatabase.com from 2008 through 2012. I hope this can serve as a quick reference to topics of interest.

Date Design
Dec 4, 2012 Good and bad XML design
Jan 21, 2012 Business Records in the 21st Century
Jan 8, 2012 Data Normalization Reconsidered
Jan 21, 2011 Name/Value Pairs – A pretty bad idea in XML as in Relational !
Jan 8, 2011 XML Manipulation in Application code vs. a Native XML Database
Sep 28, 2010 5 Reasons for Storing XML in a Database
Sep 26, 2008 Flexible Schemas: When to Persist Data in XML Instead of Relational
Aug 4, 2008 When to Store in XML Format
Mar 21, 2008 When to use a Native XML Database
Date Performance
Oct 23, 2012 Overcoming Performance Obstacles in XML Encryption
Nov 15, 2011 TPoX 2.1 has been released!
May 9, 2011 News Flash: Intel publishes TPoX Benchmark results on new 10-core Westmere-EX CPUs
Mar 4, 2011 New TPoX Benchmark Results Available
Jan 8, 2011 XML Manipulation in Application code vs. a Native XML Database
Aug 22, 2010 XML versus Relational Database Performance
July 14, 2010 A 10TB XML Data Warehouse Benchmark
Nov 3, 2008 Benchmark for 1TB Transactional XML System
Mar 20, 2008 Native XML versus CLOB and Shredding
Date Migration
June 29, 2012 How to migrate from the XML Extender to DB2 pureXML
Oct 3, 2010 How to migrate XML from LOB to XML columns
Date Relational to XML or XML to relational
Aug 30, 2010 XML Construction in Views and User-Defined Functions
Aug 24, 2011 How to generate XML from relational data – with line breaks!
June 8, 2011 How to quickly produce XML from relational tables
June 5, 2011 How to insert relational data into existing XML documents (Part 2)
May 21, 2011 How to add relational data into existing XML documents (Part 1)
Date XML Queries
Sep 13, 2012 How to Validate XML Documents in Database Queries
Aug 14, 2012 Result set cardinalities when querying repeating XML elements
Aug 10, 2012 Using the XMLTABLE function with MERGE statements in DB2 for z/OS
July 30, 2012 Using the XMLTABLE function in UPDATE and MERGE statements
May 24, 2012 How to query CDATA sections in XML
Feb 4, 2012 How to list the paths of all elements in an XML document?
Nov 30, 2011 XQuery support in DB2 10 for z/OS
Oct 9, 2011 Advanced SQL/XML: Joins and FLWOR Expressions in the XMLTABLE Function
Aug 29, 2011 Quantified expressions in XQuery: When ‘some’ and ‘every’ satisfy!
Mar 18, 2011 XQuery and SQL/XML: How to convert a date that is not a date into a date?
Aug 11, 2010 How to order query results based on XML values
May 26, 2010 XML Profiling – How to get a list of all elements and attributes
May 9, 2010 How to deal with dirty data
July 24, 2011 XQuery and Other XML Manipulation in SQL Stored Procedures
April 6, 2011 Returning data from one path based on predicates in another (part 2/2)
April 3, 2011 Intra-document references: Returning data from one path depending on predicates in another path
June 17, 2010 An XQuery Cheat Sheet
April 12, 2010 The XMLTABLE function and a case where a left outer join can help
Feb 21, 2010 Get Started with XML Full-Text Search in DB2
Jan 25, 2010 tokenize it !
Jan 11, 2010 Watch your step! Positional predicates in XPath.
Oct 14, 2008 XQuery: Introduction, Tips, and Future Directions
July 25, 2008 XQuery versus SQL/XML – Part Deux
June 26, 2008 XQuery versus SQL/XML
Date XML Storage and Indexing
July 11, 2012 XML Storage in DB2: To inline or not to inline?
June 8, 2012 XML Indexing in DB2 9.x and DB2 10 for Linux, UNIX, and Windows
July 11, 2011 How to index XML documents in the presence of schema diversity?
July 19, 2010 Structure-agnostic XML indexes in DB2
Dec 26, 2010 How to insert XML documents from the file system?
Feb 21, 2010 Get Started with XML Full-Text Search in DB2
May 6, 2009 DB2 Compresses XML Data by 60% to 80%
May 3, 2009 Multi-Dimensional Clustering for Tables with Native XML Data
May 1, 2009 Use Hash Partitioning for Fast Analysis of XML Data
Oct 16, 2008 DBA Concerns about Native XML Storage
April 8, 2008 Native XML Reduces Storage Costs
Date XML Schemas
Sep 13, 2012 How to Validate XML Documents in Database Queries
Feb 29, 2012 What is an XML Schema and why should I care?
Aug 22, 2008 Schema Evolution
Aug 21, 2008 Schema Flexibility
Date Application Development
Sep 29, 2011 At your service: REST with DB2 pureXML!
Aug 8, 2011 Using XQuery across Application Server and Database Server
July 30, 2011 Beyond the Database: Native XML and XQuery in the Application Server
Feb 25, 2011 CICS meets XML: Using CICS Web Servives with DB2 pureXML
Jan 8, 2011 XML Manipulation in Application code vs. a Native XML Database
Oct 8, 2010 Buidling XML Solutions with Rational Software Architect (RSA) and Rational Application Dev
Mar 1, 2010 XML Support in JDBC 4.0: The SQLXML Interface
Nov 9, 2009 Building a JSON and DB2 pureXML Application
Aug 5, 2008 Native XML Storage Reduces Development Costs
Nov 18, 2008 Referential Integrity and XML Data
Date XML Use Cases and Case Studies
Sep 12, 2011 DB2 pureXML – Rich in Proteins!
June 24, 2011 A true story about using XML forms and DB2 to process millions of tax returns
May 12, 2010 Webcast on XML Use Cases (May 18)
Feb 7, 2010 BJC HealthCare Improves Clinical Research with DB2 pureXML
Nov 2, 2009 Electronic Forms Using Adobe PDF and IBM DB2 pureXML
Sep 8, 2009 The Patient-Centered Medical Home
Aug 24, 2009 Electronic Health Records for Smarter Healthcare
July 17, 2009 Article about XML for Healthcare
June 16, 2009 Building an XML-Based Electronic Forms Solution
June 5, 2009 Short Video Highlighting Early Experiences with DB2 9.7
Mar 30, 2009 GUS Group uses DB2 pureXML for ERP System
Mar 13, 2009 UCLA Health System uses XML and SOA to Optimize Data Integration and Migration
Mar 2, 2009 Storing and Retrieving ACORD Data for Insurance
Feb 26, 2009 Speedy Communications and Collaboration in XML-based Environments
Dec 30, 2008 XForms and Native XML Storage
Mar 27, 2008 How to Choose Solution Areas
Mar 25, 2008 More information about Storebrand
Mar 23, 2008 Productivity Improvements at Storebrand
Date ETL and Warehousing
April 13, 2011 XML ETL with DataStage 8.5
July 14, 2010 A 10TB XML Data Warehouse Benchmark
June 23, 2010 More on Data Warehousing and XML
April 5, 2010 More on XML ETL
Mar 26, 2010 XML ETL
Jan 31, 2010 New Article on XML in the Data Warehouse
May 28, 2009 XML in the Data Warehouse
April 22, 2009 Fast Performance for Native XML Data in the Warehouse
Aug 24, 2009 Article on Integrating XML Operational Data into a Data Warehouse
Date DB2 pureXML product news
April 19, 2012 XML Enhancements in DB2 10 for Linux, UNIX, and Windows
Nov 30, 2010 Node-level XML Updates in DB2 10 for z/OS
Nov 18, 2010 New XML Features in DB2 10 for z/OS !
June 8, 2010 Updated version of a free XML database: DB2 Express-C 9.7.2
Aug 7, 2009 SOA Projects: IBM DB2 versus Oracle Database
Feb 11, 2009 XML Database in the Cloud
Feb 10, 2009 pureXML Moves into Core DB2 Product
Date Other Products and Tools
Dec 5, 2010 XML Tools for DB2 pureXML
Mar 8, 2010 Integration of Altova XMLSpy with DB2 pureXML
Jan 18, 2010 XPath and XQuery Functions in SQL Server vs. DB2
Jan 4, 2010 Shredding and constructing XML in mySQL
Dec 23, 2009 XML Support in mySQL
Dec 17, 2009 Oracle vs. mySQL: The Full Monty
Aug 7, 2009 MyCareTeam and IBM Collaborate to Improve Continuing Care for Diabetes
July 8, 2009 Why Won’t Oracle Publish XML Benchmark Results for TPoX?
Sep 30, 2008 XML in Oracle 11g
Sep 24, 2008 XML in SQL Server 2008
Sep 19, 2008 XML in SQL Server 2005
July 11, 2008 Questions for XML Database Vendors
May 13, 2008 XML-only Databases
May 12, 2008 All Native XML Databases are not Equal
Date Publications and Resources
Feb 13, 2011 Extremely pureXML in DB2 10 for z/OS
Sep 14, 2010 DB2 pureXML Cookbook – Errata
Aug 13, 2009 DB2 pureXML Cookbook – 45% Discount
June 3, 2009 DB2 pureXML for Dummies . Get Your Copy!
April 29, 2010 Get off to a fast start with DB2 pureXML
April 10, 2009 DB2 pureXML Cookbook
Feb 25, 2009 Intel Insights Magazine covers Native XML Storage
Feb 5, 2009 Henrik’s Blog
Feb 5, 2009 DB2 pureXML leads SOA World Magazine Reader’s Choice Award
Jan 21, 2009 Podcast Series about DB2 pureXML on the Mainframe
Jan 20, 2009 Reports for Information in Native XML Format
Dec 30, 2008 Anant Jhingran on Improving XML Applications
Sep 1, 2008 DB2 pureXML Online Communities
July 27, 2008 Getting Started with Native XML Databases
Date XML Coverage at IDUG and IOD Conferences
Oct 21, 2012 XML from Vegas to Berlin!
Oct 18, 2011 XML in Las Vegas !
April 25, 2011 XML sessions at IDUG North America 2011
Oct 21, 2010 Next week in Vegas
April 18, 2010 DB2 for z/OS pureXML Hands-on Lab @ IDUG North America
Oct 23, 2009 DB2 pureXML at the IOD Conference
May 28, 2009 IBM IOD Conference Attendees get a Poken for Networking
May 13, 2009 Short Video from the IDUG North America Confernece
Nov 11, 2008 XML at IBM Information on Demand Conference
Sep 11, 2008 DB2 pureXML at the IOD Conference
Sep 4, 2008 Meet Native XML Databases Users
Date News, Announcements, Miscellaneous
July 30, 2010 DB2 Express-C Light – A reduced footprint XML database
Mar 15, 2010 More DB2 pureXML Bootcamps
Feb 14, 2010 DB2 pureXML Webinars and Bootcamps
Dec 15, 2009 Hi there here is the new guy!
Dec 13, 2009 Welcome Matthias Nicola!
June 9, 2009 Flirting with Poken
April 22, 2009 The XML Rap Superstar
April 22, 2009 IDUG announces winners of XML Challenge
April 20, 2009 Teaser for Upcoming DB2 pureXML Features
Mar 12, 2009 DB2 and Cloud Computing – Chat with the Labs
Feb 5, 2009 Free Online Conference – Data in Action
Jan 20, 2009 DB2 Technology Sandbox
Dec 26, 2008 Native XML Storage for the Mac
Dec 3, 2008 Update on the XML Challenge
Nov 5, 2008 Why IBM DB2 is ideal for Transactional XML
Oct 9, 2008 10 Reasons why DBAs Should Understand Native XML
Oct 2, 2008 XML Challenge Web Site is Live!
Sep 28, 2008 1st Oct is Online Community Action Day
Aug 7, 2008 Webcast: A primer for storing and retrieving XML data
July 1, 2008 Viral Video – Wednesday
June 20, 2008 Viral Video – Tuesday
June 16, 2008 Viral Video – Monday
June 13, 2008 Learn to Use XML with Databases and win Prizes!
Mar 22, 2008 How to Market New Technology
Mar 19, 2008 Welcome

Happy browsing!

Good and bad XML design

December 4, 2012

Whenever people design things –such as houses, software systems, database schemas, or even XML structures– there are good and not so good design options.  Roughly speaking, a design is usually considered “good” if it allows for easy and (cost-)efficient use and implementation of whatever is being built.

For XML documents we have seen various cases of “good” and “bad” designs over the years. Previously I wrote about name/value pair design for XML documents, which often is not a good choice.

Recently I received a question regarding another design choice, which I would like to share here in a simplified form.

The question was about an XML message format to transmit a list of order numbers and the corresponding order system identifiers. Two alternative XML structures were proposed:

Option A:

<OrderList>
  <Order>
    <OrderNumber>A0000001</OrderNumber>
    <OrderSystem>ABC</OrderSystem>
  </Order>
  <Order>
    <OrderNumber>B0000001</OrderNumber>
    <OrderSystem>XP1</OrderSystem>
  </Order>
  <Order>
    <OrderNumber>C0000001</OrderNumber>
    <OrderSystem>Q-9</OrderSystem>
  </Order>
</OrderList>

Option B:

<OrderList>
    <OrderNumber>A0000001</OrderNumber>
    <OrderSystem>ABC</OrderSystem>
    <OrderNumber>B0000001</OrderNumber>
    <OrderSystem>XP1</OrderSystem>
    <OrderNumber>C0000001</OrderNumber>
    <OrderSystem>Q-9</OrderSystem>
</OrderList>

The motivation for option B might be a smaller message size. One might argue that XML elements are by definition ordered, and if we know that the message consists of pairs of order number and order system, then XML format B is as easy to interpret as format A. Or, is it not?

The problem that I see with option B is that the absence of the <Order> elements gives up the explicit structure of the data and instead requires implicit knowledge, i.e. one must know that the OrderNumber at position i belongs to the OrderSystem at position i+1.

In my view, this design is against the principles of XML. In XML, the tags and their nesting should describe the data explicitly. XML is meant to be a hierarchical data format so that data items that belong together are grouped under the same parent (as in option A). After all, that is the benefit of XML over a flat file format.

I also think that using the explicit structure of option A is less error-prone than using option B which, requires implicit knowledge based on element positions.

How about queryability?

It turns out that XML format A is considerably easier to query than option B. Let’s assume you need to code a SQL/XML query to list the order numbers and systems. For that exercise we assume that the messages are stored in a table “mytable” with XML column “doc”.

The query for XML option A is very simple, using the XMLTABLE function in its most basic form. The row-generating XPath expression $DOC/OrderList/Order iterates over the orders, and the column expressions extract the order number and system.
http://www.ibm.com/developerworks/data/library/techarticle/dm-0708nicola/

-- Query for XML option A:

SELECT OrderNumber, OrderSystem
FROM mytable,
     XMLTABLE ('$DOC/OrderList/Order'
       COLUMNS
         OrderSystem VARCHAR(20) PATH 'OrderSystem',
         OrderNumber VARCHAR(20) PATH 'OrderNumber' );

ORDERNUMBER          ORDERSYSTEM
-------------------- --------------------
A0000001             ABC
B0000001             XP1
C0000001             Q-9

3 record(s) selected.

How can you produce the same result set from XML format B? This is remarkably tricky! You need to iterate over the child elements and keep track of which element is at position i versus i+1 to produce the proper pairs in the output.

There are likely several ways to code the query for XML option B, and you see one solution below. It seems hard to query the elements by name alone because the important structural information is by position. Hence, this query iterates over the positions 1 to N, where N is the number of elements. For each even position 2, 4, 6, etc. the query produces the pair of the current ($pos) and the previous ($pos – 1) element to construct the proper output.

-- Query for XML option B:

SELECT OrderNumber, OrderSystem
FROM mytable,
     XMLTABLE('for $pos in (1 to count($DOC/OrderList/*))
               where $pos mod 2 = 0
               return  <pair>
                         {$DOC/OrderList/*[$pos - 1]}
                         {$DOC/OrderList/*[$pos]}
                       </pair>'
       COLUMNS
         OrderNumber VARCHAR(20) PATH 'OrderNumber',
         OrderSystem VARCHAR(20) PATH 'OrderSystem' );

OrderSYSTEM          OrderNUMBER
-------------------- --------------------
ABC                  A0000001
XP1                  B0000001
Q-9                  C0000001

3 record(s) selected.

This query is not only more complex but also potentially less efficient. Note that the return clause of the FLOWR expresion constructs the explicit pairing that it missing in XML format B.

Let me know if you can think of a simpler query to produce the same result from XML format B.

Next, imagine a message that doesn’t have just 2 fields per order but maybe 10 or 100 fields per order. Then option B becomes increasingly more ugly.

Or worse, what if there is a schema change that allows one order to have a different number of fields than the next? Then the query that we coded for XML format B will no longer work because it was based on a specific structural assumption that is no longer true. In contrast, our query for XML format A will work even if you add additional fields to some of the orders in the message.

In summary, the benefits for XML option A include the following:

  • Explicit structure rather than implicit positional format
  • Easier and more efficient to query (query by name rather query by postion)
  • More resilient to schema changes

I think that these benefits far outweigh the concern that XML format A is larger. Also, you could chose to send or store the XML in a compressed format to greatly alleviate space concerns.