nativexmldatabase.com – 2008 to 2012
December 28, 2012
This blog on XML database technology and DB2 pureXML has been active for almost 5 years now. My former colleage Conor O’Mahony ran this blog from early 2008 to December 2009, and I took over on Dec 15, 2009. We have received a lot of feedback and many of the blog posts were inspired by questions from XML database users.
I’m not sure yet how much time I will have for blogging in 2013 and beyond. In any case, I liked the idea of producing a summary page that provides a quick overview of all the topics and posts in this blog. But, how to produce such a summary most efficiently?
Well, true to the topic of this blog I decided to load this blog as a piece of XML into DB2 pureXML and run queries on it!
First I downloaded the complete blog as an HTML document. But, HTML is not necessarily wellformed XML because HTML allows elements with missing end tags and other goofy things. Luckily there are free tools available that convert HTML into XHTML. XHTML is a form of HTML that complies with the rules for well-formed XML documents.
After converting the entire blog into an XML (XHTML) document, it took only three steps to create a list of blog post titles, dates, and URLs:
CREATE TABLE nativexmldatabase(doc XML); LOAD FROM blog.del OF DEL REPLACE INTO nativexmldatabase(doc); SELECT date, title, url FROM nativexmldatabase, XMLTABLE('$DOC//div[@class="entrytitle"]' COLUMNS seq FOR ORDINALITY, date VARCHAR(20) PATH 'h3', title VARCHAR(105) PATH 'h2/a', url VARCHAR(145) PATH 'h2/a/@href') AS T;
Of course, a quick look into the XML document is necessary to discover the required tag names and XPath expressions, such as //div[@class=”entrytitle”] to iterate over all the individual posts, or h2/a and h2/a/@href to retrieve the title and URL of each post.
This query produces a nice list of all blog post titles, dates, and URLs. Instead of posting this listing as-is, I decided to roughly group the blog posts along the following topic areas:
* Design
* Performance
* Migration
* Relational to XML or XML to relational
* XML Queries
* XML Storage and Indexing
* XML Schemas
* Application Development
* Use Cases and Case Studies
* ETL and Warehousing
* DB2 pureXML Product News
* Other Products and Tools
* Publications and Resources
* XML Coverage at IDUG and IOD Conferences
* News, Announcements, Miscellaneous
And here is the result, a table of contents for nativexmldatabase.com from 2008 through 2012. I hope this can serve as a quick reference to topics of interest.
Happy browsing!
Good and bad XML design
December 4, 2012
Whenever people design things –such as houses, software systems, database schemas, or even XML structures– there are good and not so good design options. Roughly speaking, a design is usually considered “good” if it allows for easy and (cost-)efficient use and implementation of whatever is being built.
For XML documents we have seen various cases of “good” and “bad” designs over the years. Previously I wrote about name/value pair design for XML documents, which often is not a good choice.
Recently I received a question regarding another design choice, which I would like to share here in a simplified form.
The question was about an XML message format to transmit a list of order numbers and the corresponding order system identifiers. Two alternative XML structures were proposed:
Option A:
<OrderList> <Order> <OrderNumber>A0000001</OrderNumber> <OrderSystem>ABC</OrderSystem> </Order> <Order> <OrderNumber>B0000001</OrderNumber> <OrderSystem>XP1</OrderSystem> </Order> <Order> <OrderNumber>C0000001</OrderNumber> <OrderSystem>Q-9</OrderSystem> </Order> </OrderList>
Option B:
<OrderList> <OrderNumber>A0000001</OrderNumber> <OrderSystem>ABC</OrderSystem> <OrderNumber>B0000001</OrderNumber> <OrderSystem>XP1</OrderSystem> <OrderNumber>C0000001</OrderNumber> <OrderSystem>Q-9</OrderSystem> </OrderList>
The motivation for option B might be a smaller message size. One might argue that XML elements are by definition ordered, and if we know that the message consists of pairs of order number and order system, then XML format B is as easy to interpret as format A. Or, is it not?
The problem that I see with option B is that the absence of the <Order> elements gives up the explicit structure of the data and instead requires implicit knowledge, i.e. one must know that the OrderNumber at position i belongs to the OrderSystem at position i+1.
In my view, this design is against the principles of XML. In XML, the tags and their nesting should describe the data explicitly. XML is meant to be a hierarchical data format so that data items that belong together are grouped under the same parent (as in option A). After all, that is the benefit of XML over a flat file format.
I also think that using the explicit structure of option A is less error-prone than using option B which, requires implicit knowledge based on element positions.
How about queryability?
It turns out that XML format A is considerably easier to query than option B. Let’s assume you need to code a SQL/XML query to list the order numbers and systems. For that exercise we assume that the messages are stored in a table “mytable” with XML column “doc”.
The query for XML option A is very simple, using the XMLTABLE function in its most basic form. The row-generating XPath expression $DOC/OrderList/Order iterates over the orders, and the column expressions extract the order number and system.
http://www.ibm.com/developerworks/data/library/techarticle/dm-0708nicola/
-- Query for XML option A: SELECT OrderNumber, OrderSystem FROM mytable, XMLTABLE ('$DOC/OrderList/Order' COLUMNS OrderSystem VARCHAR(20) PATH 'OrderSystem', OrderNumber VARCHAR(20) PATH 'OrderNumber' ); ORDERNUMBER ORDERSYSTEM -------------------- -------------------- A0000001 ABC B0000001 XP1 C0000001 Q-9 3 record(s) selected.
How can you produce the same result set from XML format B? This is remarkably tricky! You need to iterate over the child elements and keep track of which element is at position i versus i+1 to produce the proper pairs in the output.
There are likely several ways to code the query for XML option B, and you see one solution below. It seems hard to query the elements by name alone because the important structural information is by position. Hence, this query iterates over the positions 1 to N, where N is the number of elements. For each even position 2, 4, 6, etc. the query produces the pair of the current ($pos) and the previous ($pos – 1) element to construct the proper output.
-- Query for XML option B: SELECT OrderNumber, OrderSystem FROM mytable, XMLTABLE('for $pos in (1 to count($DOC/OrderList/*)) where $pos mod 2 = 0 return <pair> {$DOC/OrderList/*[$pos - 1]} {$DOC/OrderList/*[$pos]} </pair>' COLUMNS OrderNumber VARCHAR(20) PATH 'OrderNumber', OrderSystem VARCHAR(20) PATH 'OrderSystem' ); OrderSYSTEM OrderNUMBER -------------------- -------------------- ABC A0000001 XP1 B0000001 Q-9 C0000001 3 record(s) selected.
This query is not only more complex but also potentially less efficient. Note that the return clause of the FLOWR expresion constructs the explicit pairing that it missing in XML format B.
Let me know if you can think of a simpler query to produce the same result from XML format B.
Next, imagine a message that doesn’t have just 2 fields per order but maybe 10 or 100 fields per order. Then option B becomes increasingly more ugly.
Or worse, what if there is a schema change that allows one order to have a different number of fields than the next? Then the query that we coded for XML format B will no longer work because it was based on a specific structural assumption that is no longer true. In contrast, our query for XML format A will work even if you add additional fields to some of the orders in the message.
In summary, the benefits for XML option A include the following:
- Explicit structure rather than implicit positional format
- Easier and more efficient to query (query by name rather query by postion)
- More resilient to schema changes
I think that these benefits far outweigh the concern that XML format A is larger. Also, you could chose to send or store the XML in a compressed format to greatly alleviate space concerns.