Part 2 of our article “Data normalization reconsidered” is now available at

http://www.ibm.com/developerworks/data/library/techarticle/dm-1201normalizationpart2/index.html

The second part discusses alternatives to a traditional normalized relational representation of data. Such alternatives include for example XML, JSON, and RDF because they can often help you overcome normalization issues or improve schema flexibility, or both. In the 21st century, digitized business records are often created in XML to begin with, which makes XML an attractive choice as the database level storage format.

This article also contains a performance comparison between XML and relational data that was conducted for a real-world application scenario at a major international logistics company.

At the end of the article you find comparison tables that summarize the pros and cons of different data representations.

Normalization is a design methodology for relational database schemas and aims to minimize data redundancy and avoid data anomalies, such as update anomalies. The consequence of normalization is that business records (such as a purchase order, an insurance claim, a financial transaction, etc.) are split into pieces that are scattered over potentially many relational tables.

In addition to its benefits, normalization also introduces several drawbacks:

  • The insert of a single logical business record requires the insertion of multiple (often many) physical rows
  • The retrieval of a single business record requires complex multi-way joins or a series of separate queries
  • Business records undergo a potentially expensive conversion from their original representation outside the database to a normalized format and back
  • The normalized representation of business records is often difficult to understand because it is very different from the original format of the business record, such as a paper form or an XML message.

These issues raise the question whether normalization should be applied as categorically as some people believe. Indeed, there are several reasons for reconsidering normalization, such as:

  • Throughout history, humans have always stored their business records “intact”, and it was only the introduction of databases that has “required” normalization
  • Normalization was introduced when storage space was extremely scarce and expensive, which is not (or much less) the case anymore today
  • Today, business records are often much more complex than they used to be in the 1970s when normalization was introduced, and this complexity amplifies the disadvantages of normalization
  • De-normalization is becoming more and more popular, e.g. in star schemas for data warehousing, but also in emerging storage systems such as HBase, Google’s BigTable, etc.

Today, business records are often created and exchanged in a digital format, and this format is often XML. XML is a non-normalized data format that can provide several benefits:

  • A single business record often maps naturally to a single XML document
  • A single business record/XML document can be inserted (and retrieved) in an XML database as a single operation
  • If you store XML as XML (i.e. without conversion to relational), the representation of a business record is the same inside and outside the database, which is tremendously valuable

When business records already exist in XML format outside the database anyway, then it is usually best to also store them as XML and not to convert into a normalized relational schema.

My colleague Susan Malaika and I have collected our thoughts and observations on normalization in a 2-part article titled “Data Normalization Reconsidered“. The first part has recently been published on developerWorks and can be found here:

http://www.ibm.com/developerworks/data/library/techarticle/dm-1112normalization/index.html

The 2nd part will appear soon. Happy reading!

If you think that mainframe computers are old dinosaurs that only run ancient COBOL code – think again! Now mainframes also run XQuery!

While DB2 for Linux, UNIX, and Windows has been supporting XQuery and SQL/XML since Version 9.1 (released in 2006), DB2 9 for z/OS “only” supported XPath and SQL/XML.

I have put the word “only” in quotes because for many applications XPath and SQL/XML are fully sufficient. When you combine XPath expressions with SQL/XML functions such as XMLTABLE plus other SQL language constructs you can write very powerful XML queries and accomplish many of the same things that you can do with XQuery.

DB2 10 for z/OS has added a variety of new XML features such as node-level XML updates, XML-type parameters and variables in stored procedures and user-defined functions, and enhancements for XML Schemas and XML indexes.

With APARs PM47617 and PM47618, DB2 for z/OS now also supports XQuery within the SQL functions XMLTABLE, XMLQUERY, XMLEXISTS, and XMLMODIFY.

So what are the additional capabilities and benefits that XQuery provides? Examples include:

  • You can compose new XML structures using direct element constructors
  • You can use FLWOR expressions (for-let-where-order by-return) to iterate over and manipulate intermediate query results
  • You can join and combine information from multiple XML documents
  • You can use XQuery comma-expressions to construct and use new sequences of XML nodes
  • You can code if-then-else logic to implement conditional expressions
  • etc.

Let’s look at some examples.

Construct new XML structures with direct element and attribute constructors

The following query constructs a new order summary document from each order (XML document) that is selected from the “orders” table. New elements, such as <orderSummary> and <orderedItems> are constructed by providing the start and end tags explicitly. Similarly, the attribute orderNumber is also constructed explicitly. The content of the constructed elements and attributes is computed by XPath (or XQuery) expressions that extract selected information from each source document.

SELECT XMLQUERY('
          <orderSummary orderNumber="{$po/order/orderNo/text()}">
             <orderedItems>{$o/order/items/item/itemName}</orderedItems>
          </orderSummary>'
       PASSING orders.orderdoc AS "po")
FROM orders
WHERE ...

FLWOR expressions

The next query joins the tables “orders” and “items” on their XML columns “orderdoc” and “details”, respectively. The join predicate in the XMLEXISTS ensures that we find the items that match a given order and that we don’t produce a Cartesian product. For each pair of order document and item document, the FLWOR expression in the SELECT clause combines information from both documents into a new documents that contains the item name, the ordered quantity, and the item details.

SELECT XMLQUERY('for $o in $po/orders/items/item
                   for $i in $it/item
                 where $o/itemName = $i/name
                 return <orderdItem>
                           {$i/name}
                           {$o/item/quantity}
                           {$i/details}
                        </orderdItem>'
         PASSING orders.orderdoc AS "po", items.details as "it")
FROM orders, items
WHERE
XMLEXISTS('$po/order/items/item[itemName = $it/item/name]'
           PASSING orders.orderdoc AS "po", items.details as "it")

 

For more information on XQuery in DB2 10 for z/OS, here is a good place to continue reading:

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.xml/src/tpc/db2z_expover.htm

TPoX 2.1 has been released!

November 15, 2011

First, what is TPoX?  I have two answers to that question.

Answer 1:

TPoX, short for “Transaction Processing over XML”, is an XML database benchmark that executes and measures a multi-user XML workload. The workload contains XML queries (70%) as well as XML insert, update, and delete operations (30%). TPoX simulates a simple financial application that issues XQuery or SQL/XML transactions to stress XML storage, XML indexing, XML Schema support, XML updates, logging, concurrency and other components of an XML database system.

The TPoX package contains:

TPoX has been developed by Intel and IBM, but is freely available and open source since 2007. A variety of TPoX performance results and other usage of TPoX have been reported.

Answer 2:

TPoX is a very flexible and extensible tool for performance testing of relational databases, XML databases, and other systems. For example, if you have a populated relational database you can use the TPoX workload driver to parameterize, execute, and measure plain old SQL transactions with hundreds of simulated database users. I have used TPoX for a lot of relational performance testing, because it’s so easy to setup and measure concurrent workloads. The workload driver reports throughput, min/ax/avg response times, percentiles and confidence intervals for response times, and other useful metrics. Oh, and by the way, TPoX happens to include an XML data generator and a set of sample XML transactions, in case you’re interested in XML database performance.

 

In the latest release, TPoX 2.1, we have further enhanced the extensibility of the TPoX Workload Driver. The XML data and XML transactions are still the same.

Some of the enhancements in TPoX 2.1 include:

  • higher precision in reported response times
  • proper handling and counting of deadlocks, if any
  • easier post-processing of results in Excel or other spreadsheets software
  • new types of workload parameters such as random dates, random timestamps, sequences, etc.
  • in addition to SQL, SQL/XML, and XQuery, transactions can now be also supplied as Java plugins, allowing you to run and measure anything (concurrently!) that you can code in Java, such as:
  • complex transactions that include application logic
  • calls to web services or message queues
  • obtaining data from RSS or ATOM feeds
  • transactions against databases or content repositories that do not have a JDBC interface

We have already found these extensions extremely valuable for some of our own performance testing, and we’re happy to share them. You can download TPoX 2.1 (free, open source) and find more detailed information in the release notes as well as the TPoX documentation that is included in the download.

XML in Las Vegas !

October 18, 2011

The annual Information on Demand conference in Las Vegas (Oct 23-27, 2011) is always a great venue to get updated on the latest topics in data management. The IOD conference offers a broad range of technical sessions, business sessions, hands-on labs, and lots of networking opportunities. More than 200 customer speakers will present their first-hand experiences with IBM Software.

The conference is in the Mandalay Bay conventation center. Various topics around DB2 pureXML are covered in presentations and hands-on labs. Some of them are listed here:

2088A – DB2 pureXML for Beginners
Mon, Oct 24, 2011, 3:45 PM – 5:00 PM

1120B – DB2 pureXML: Develop Your Application Prototype in Minutes, not Days
Tue, Oct 25, 2011, 9:45 AM – 12:45 PM (hands-on lab)

1930A – DB2 for z/OS SOA Solutions Powered by pureXML and DataPower
Wed, Oct 26, 2011, 3:15 PM – 4:15 PM

2691A – Architecting with IBM pureXML and Temporal Data in DB2 10 for z/OS
Thu, Oct 27, 2011, 8:15 AM – 9:30 AM

2087B – Get Connected: Publishing Relational Data as XML Messages
Thu, Oct 27, 2011, 3:30 PM – 4:30 PM

If you’re going to IOD, don’t miss these sessions!

If you look at existing documentation and examples of the XMLTABLE function, you find that most examples use simple XPath expressions inside the XMLTABLE function and rarely more complex XQuery expressions such as FLWOR expressions. This is also true for my 2 articles on XMLTABLE:

XMLTABLE By Example, Part1 and Part2:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0708nicola/
http://www.ibm.com/developerworks/data/library/techarticle/dm-0709nicola/

Why is that? Is it because XQuery expressions other than plain path expressions are never required in the XMLTABLE function? Well, let’s take a look at that question…

First, I do believe that many typical SQL/XML queries can indeed be coded with just regular XPath in the XMLTABLE function. I have seen that this is true in many real application deployments.

But, sometimes it can certainly be useful to embed more complex XQuery expressions than just XPath in an XMLTABLE function. For example, in my blog post “How to get a list of all elements and attributes” you saw that I used the XQuery comma operator to construct a sequence from two expressions. In that case the comma expression was //(*,@*) to get all elements and all attributes. In the same post you also saw the use of the XQuery if-then-else expression in the column definition of the XMLTABLE function.

Using FLWOR expression in the XMLTABLE function can be useful if you join and combine XML from two different XML columns. For example let’s consider the following two tables with te subsequent sample documents:


CREATE TABLE order(orderdetails XML);
CREATE TABLE product(description XML);

INSERT INTO order VALUES('
<order>
  <orderNo>123</orderNo>
  <item>
    <pid>24TX98</pid>
    <quantity>1</quantity>
    <price>29.95</price>
  </item> 
  <item>
    <pid>901V8</pid>
    <quantity>2</quantity>
    <price>5.00</price>
  </item> 
</order>');

INSERT INTO product VALUES('
<product pid="24TX98">
  <name>Outdoor paint</name>
  <description>
    <weight>3kg</weight>
    <color>white</color>
  </description>   
</product>');

INSERT INTO product VALUES('
<product pid="901V8">
  <name>Paint brush</name>
  <description>
    <weight>0.1kg</weight>
  </description>   
</product>');

The order table contains one XML document per order, and the product table one XML document per product. Each product has a @pid attribute as a unique identifier, and  orders have /order/item/pid elements to specify which products have been ordered. Naturally, this allows us to perform a join between orders and products.

For example, the following query is a join to retrieve the product information for all the products that have been ordered in order 123:


SELECT product.description
FROM order, product
WHERE XMLEXISTS('$ORDERDETAILS/order[orderNo = 123]')
  AND XMLEXISTS('$DESCRIPTION/product[ @pid/fn:string(.) =
                   $ORDERDETAILS/order/item/pid/fn:string(.) ]');

This query returns one row (one XML document) for each product that is referenced in order 123.

Now, what if you want to combine data from an order document and the associated product documents, and return this information in one relational row per product?  For example, assume you want rows that show the product ID and product weight (from the product documents) as well as the price and quantity of that product in the order document.

In that case you could use an XQuery FLWOR expression to perform the join and combine the product and order information in the XMLTABLE function, as shown in the following query.

Since the product and order tables appear in the FROM clause of the query, the FLWOR expression references their XML columns through the variables $DESCRIPTION and $ORDERDETAILS. The return clause of the FLWOR expression constructs XML fragments that combine the desired elements and attributes from each matching pair of product and order documents. The constructed <result> elements are then input to the COLUMNS clause where they are broken up into relational columns.


SELECT T.*
FROM order, product,
     XMLTABLE('for $o in $ORDERDETAILS/order[orderNo = 123]/item
                for $p in $DESCRIPTION/product
                where $p/@pid/fn:string(.) = $o/pid/fn:string(.)
                return
                    <result>
                        {$p/@pid}
                        {$p/description/weight}
                        {$o/quantity}
                        {$o/price}
                    </result>'
         COLUMNS
             prodid VARCHAR(15)  PATH '@pid',
             weight VARCHAR(5)   PATH 'weight',
             qty    INTEGER      PATH 'quantity',
             price  DECIMAL(6,2) PATH 'price')
         AS T;

   
PRODID          WEIGHT QTY         PRICE
--------------- ------ ----------- --------
24TX98          3kg              1    29.95
901V8           0.1kg            2     5.00

  2 record(s) selected.

It is worthwhile noting that you can produce the same result set without the use of FLWOR expressions, as shown in the next query. This query uses two XMLTABLE function, one for the desired order information and one for the desired product information. The key trick is that the two XMLTABLE function are joined on the product ID using the predicate [@pid = $p]. Without this predicate the two XMLTABLE functions would produce a Cartesian product, which is not desired.


SELECT T2.prodid, T2.weight, T1.qty, T1.price
FROM order, product,
     XMLTABLE('$ORDERDETAILS/order/item'
        COLUMNS
           prodid VARCHAR(15)  PATH 'pid',
           qty    INTEGER      PATH 'quantity',
           price  DECIMAL(6,2) PATH 'price')
     AS T1,
     XMLTABLE('$DESCRIPTION/product[@pid = $p]'
                                  PASSING T1.prodid AS "p"
        COLUMNS
           prodid VARCHAR(15)  PATH '@pid',
           weight VARCHAR(5)   PATH 'description/weight')
     AS T2;

 
                  
PRODID          WEIGHT QTY         PRICE
--------------- ------ ----------- --------
24TX98          3kg              1    29.95
901V8           0.1kg            2     5.00

  2 record(s) selected.

The join predicate between the two XMLTABLE functions can also be placed in the SQL WHERE clause:


SELECT T2.prodid, T2.weight, T1.qty, T1.price
FROM order, product,
     XMLTABLE('$ORDERDETAILS/order/item'
        COLUMNS
           prodid VARCHAR(15)  PATH 'pid',
           qty    INTEGER      PATH 'quantity',
           price  DECIMAL(6,2) PATH 'price')
     AS T1,
     XMLTABLE('$DESCRIPTION/product'
        COLUMNS
           prodid VARCHAR(15)  PATH '@pid',
           weight VARCHAR(5)   PATH 'description/weight')
     AS T2
WHERE T1.prodid = T2.prodid;       

However, this query applies the join predicate after the XMLTABLE functions have produced their rows, which can be slower than the previous query where the predicate was in the XMLTABLE function itself.

You can find many more examples of the XMLTABLE function, XML joins, and joins between XML and relational columns in Chapter 9 of the DB2 pureXML Cookbook.

Undeniably, XML is the message format of choice for many service-oriented architectures and application integration efforts. Also, many SOA and web service implementaions use REST as the protocol for accessing URL-addressable resources and services. REST stands for Representational State Transfer and is built on top of HTTP, which acts as the underlying transport layer.

With a new set user-defined functions (UDFs) in DB2, it has become very simple to issue REST requests directly from your SQL statements or DB2 stored procedures. This enables DB2 to easily interact with REST-based services and integrate information from the web or URL-based resources into the database.

The new REST UDFs allow you to receive and provide information in binary format (BLOB) or textual format (CLOB), which includes but is not limited to XML. The basic REST UDFs are scalar functions that perform the simple HTTP operations GET, POST, PUT, and DELETE. These UDFs are:

  • DB2XML.HTTPGETCLOB
  • DB2XML.HTTPPOSTCLOB
  • DB2XML.HTTPPUTCLOB
  • DB2XML.HTTPDELETECLOB
  • DB2XML.HTTPGETBLOB
  • DB2XML.HTTPPOSTBLOB
  • DB2XML.HTTPPUTBLOB
  • DB2XML.HTTPDELETEBLOB

Each of these functions take a URL and, optionally, an HTTP header as input paramaters. Additionally, the POST and PUT functions have a third parameter for the BLOB or CLOB data that you want to upload.

The REST UDFs are implemented in Java and support HTTP as well as HTTPS with SSL encryption.

When you use these UDFs in SQL statements to receive information, you can:

  • insert the information into DB2 tables
  • use the information as a parameter to search for related data in DB2
  • join the information with existing data in DB2
  • process the information in a stored procedure
  • or use the information in any other way that SQL provides

In particular, if the information is in XML format you can apply any DB2 pureXML functions to it, such as:

  • the XMLQUERY function to extract pieces from the XML message
  • the XMLTABLE function to split the XML into smaller XML pieces
  • the XMLTABLE function to shred the XML into relational tables
  • the XMLVALIDATE function to validate the XML against an XML Schema
  • etc.

If you are interested and want to try this for yourself, read the following article:
“Accessing HTTP and RESTful services from DB2: Introducing the REST user-defined functions for DB2”
http://www.ibm.com/developerworks/data/library/techarticle/dm-1105httprestdb2/

This article describes the REST functions for DB2 in more detail and presents several concrete usage examples. The UDFs themselves are available for download at the bottom of this article.

Enjoy your REST !

How much protein is in a steak? Well, a 6-ounce steak can contain about 38 grams worth of protein.

How much protein does salmon contain? Approximately 34 gram of protein in a 6 ounce piece of salmon.

How much protein can you find in lentils? A cup of cooked lentils has 18 rams of protein.

And how much protein can you find in DB2 pureXML? Tons! Hundreds of Gigabytes and soon over a 1TB of proteins!

It’s true, DB2 pureXML is an excellent source for proteins. DB2 can store, compress, and index the proteins so that you always find the right ones quickly when you need some.

What distinguishes DB2 from a steak is that DB2 stores all proteins in XML  or in a hybrid XML/relational format! This enables members of the biological research community to analyze and compare the structure and composition of protein molecules. The findings can help them explain diseases, develop new drugs, or understand the interactions between different proteins.

The protein data is publicly available from the Protein Data Bank (PDB) which is world-wide repository of structural protein data. To facilitate data exchange and flexibility, the data is available in XML.

Typically one XML document describes a single protein molecule. Such a document includes detailed information about all the atoms that the protein consists of, which can be hundreds or sometimes thousands of atoms. And to represent the structure of the protein, the 3-dimensional spatial coordinates of each atom are included as well, resulting in XML documents that can be hundreds of MB in size to describe a single protein.

Searching and analyzing such amounts of complex information is a challenge. To tackle this challenge, researchers in Germany decided to harness the hybrid XML/relational capabilities of DB2 to efficiently store and query the protein data. I was happy to assist them to get the most out of DB2 pureXML.

The article “Managing the Protein Data Bank with DB2 pureXML” describes the database design and optimization that facilitate protein data analysis in a scalable manner. Even if you’re not a biologist (I’m not!), the article is an interesting cases study in how a real-world data management problem has been mapped to the hybrid features of the DB2 database system.

If you are familiar with SQL you have probably seen quantified SQL predicates that compare a single value with a set of values. For example, the following SQL query returns all rows from table1 where col1 is greater than all values in col2 of table2.

SELECT *
FROM table1
WHERE col1  > ALL (SELECT col2 FROM table2)

And if you replace the keyword ALL with the keyword SOME, then the query above returns all rows from table1 where col1 is greater than at least one of the values in col2 of table2.

These comparisons are also known as universal quantification (“ALL”) and existential quantification (“SOME”), respectively.

Similar quantified comparisons are possible in XQuery too, but the syntax is slightly different. In XQuery you have the keywords “every … satisfies ….” for  universal quantification and “some … satisfies …” for existential quantification. Let’s use the following table and two simple sample documents to explore how they work:

create table customer(doc XML);

insert into customer values(‘
<customer id=”1″>
    <phone type=”cell”>408-516-4963</phone>
    <phone type=”cell”>408-087-1234</phone>
    <phone type=”cell”>408-111-222</phone>
</customer>
‘);

insert into customer values(‘
<customer id=”2″>
    <phone type=”home”>408-516-4963</phone>
    <phone type=”cell”>408-087-1234</phone>
    <phone type=”office”>408-111-222</phone>
</customer>
‘);

Now assume that we want to find customers where all their phone numbers are of type “cell”. We can code a corresponding query in XQuery or in SQL/XML notation as follows:

xquery
for $c in db2-fn:xmlcolumn(“CUSTOMER.DOC”)/customer
where every $p in $c/phone satisfies $p/@type = “cell”
return $c;

select *
from customer
where xmlexists (‘$DOC/customer[ every $p in ./phone satisfies $p/@type = “cell”]’);

Both queries return just the first of the two customer documents. The second document is not selected because not all of its phone elements have a @type attribute with the value “cell”.

The quantified expression every $p in $c/phone satisfies $p/@type = “cell” means that $p iterates over all the items in the sequence $c/phone,  and for each such $p the predicates $p/@type = “cell” must be true.

If we want to return all documents where at least one of the phone number is of type “cell” (existential quantification!), then we can simply replace the keyword “every” with the keyword “some”, such as in the following query:

select *
from customer
where xmlexists (‘$DOC/customer[ some $p in ./phone satisfies $p/@type = “cell”]’);

This query returns both our sample documents, because both documents contain at least one phone element where the @type attribute has the value “cell”. Since XPath uses existential quantification by default, you can often omit the “some … satisfies …” and obtain the same result with an ordinary path expression, like this:

select *
from customer
where xmlexists (‘$DOC/customer[phone/@type = “cell”]’);

For more information, use the following resources.

DB2 Information Center – Quantified Expressions:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/xqrquanexp.html

XQuery specification of quantified expressions:
http://www.w3.org/TR/xquery/#id-quantified-expressions

Generating XML from relational data is not difficult. The SQL standard contains a set of XML construction functions that take column names or other SQL expressions as input and return XML. The most prominent of these XML construction functions are:

  • XMLELEMENT (constructs an XML element)
  • XMLATTRIBUTES (constructs one or more XML attributes)
  • XMLFOREST (constructs a sequence of XML elements)
  • XMLTEXT (constructs an XML text node)
  • XMLAGG (aggregates data from multiple rows into a single XML fragment)
  • XMLNAMESPACES (constructs a namespace declaration)
  • etc.

Here is a simple example:

create table mydata(c1 int, c2 varchar(25));
insert into mydata values (1, ‘This is a test’);

select xmlelement(name “myroot”,
               xmlelement(name “col1”, c1),
               xmlelement(name “col2”, c2) )
from mydata;

<myroot><col1>1</col1><col2>This is a test</col2></myroot>

 1 record(s) selected.

The XML is typically returned in textual (serialized) format, as a single line without any line breaks. That’s perfectly fine, except that line breaks can sometimes be useful to make the XML easier to read for the human eye.

There are several ways in which you add add newline characters to the XML to make it more readable.

If the XML that you construct is less then 32KB in size, then you can use the SQL function REPLACE to replace every occurrence of ‘><‘ with the hex string 3E0A3C. The hex character 3E is the same as ‘>’, the hex character 3C is the same as ‘<‘, and the hex character 0A is a line feed in between them. Before we can apply the REPLACE function to the constructed XML, we need to explicitly convert the  XML to data type VARCHAR, which can be done with the function XMLSERIALIZE.

The following query is extended from the previous example by adding the functions XMLSERIALIZE and REPLACE:

select replace(
       xmlserialize(
          xmlelement(name “myroot”,
               xmlelement(name “col1”, c1),
               xmlelement(name “col2”, c2) ) AS VARCHAR(32000)),
         ‘><‘,
         x’3E0A3C’)       
from mydata;

<myroot>
<col1>1</col1>
<col2>This is a test</col2>
</myroot>

 1 record(s) selected.

The replace function takes three parameters:

  • the constructed serialized XML
  • the string that should be searched and replaced, i.e. ‘><‘
  • the new string that replaces the search string, i.e. x’3E0A3c’

However, this approach works only for smallish XML, up to 32KB.

If you generate larger XML, you have two options. First, you can perform the replace operation at the client that receives the XML. Second, you can explicitly construct XML text nodes that contain newline characters where you want them. For example:

select xmlelement(name “myroot”,
              xmltext(x’0A’),         
              xmlelement(name “col1”, c1),
              xmlelement(name “col2”, c2),
              xmltext(x’0A’) )
from mydata;

<myroot>
<col1>1</col1><col2>This is a test</col2>
</myroot>

 1 record(s) selected.

 

select xmlelement(name “myroot”,
              xmltext(x’0A’),         
              xmlelement(name “col1”, c1),
              xmltext(x’0A’),
              xmlelement(name “col2”, c2),
              xmltext(x’0A’) )
from mydata;

<myroot>
<col1>1</col1>
<col2>This is a test</col2>
</myroot>

 1 record(s) selected.

If you want a line break after every XML element then this approach can be tedious and makes the SQL statement quite a bit longer. Then again, the line breaks are really only needed for human readability, not for any client application that consumes the generated XML.

 

In my previous post I discussed the native XML capabilities provided by the IBM WebSphere Application Server V7.0 Feature Pack for XML. This Feature Pack allows you to manipulate XML in your Java application directly with XQuery, XPath, or XSLT.

If you use the WebSphere XML Feature Pack together with DB2 pureXML, one of the obvious questions is: when should I use XPath or XQuery in the application server and when should I push these operations to DB2 and run XQuery or SQL/XML in the database layer?

The best split between XML processing in one tier vs. the other certainly depends on the exact nature of your application and may vary on a case by case basis. Still, here are some general considerations and examples.

Example 1:
Assume your application needs to retrieve an XML document from the database based on one or multiple XML search conditions, but the decision about which information to extract from the document can only be made at runtime after the document has been retrieved. In this situation it is typically best to issue an XQuery or SQL/XML query to DB2 to retrieve the full document, exploiting DB2’s XML indexes to evaluate the search conditions efficiently. The extraction of specific elements is then performed with XPath or XQuery in the application server as part of the application’s business logic.

Example 2:
As a variation of Example 1, consider a hybrid database schema where some data is in relational columns and some in XML columns. Assume you need to retrieve XML documents based on relational and XML predicates, possibly involving joins between the table that holds the XML and other fully relational tables. It is typically recommended to use a database query that contains both relational and XML predicates to minimize the amount of data that is retrieved from the database and sent to the application. The extraction of XML elements would still happen in the application server, especially if it depends on post-retrieval application logic.

Example 3:
The application needs to retrieve specific elements from one or multiple XML documents, and these elements are fixed and known before the database query is submitted. In this case it is typically best to code the search and the extraction in a single XQuery or SQL/XML statement for DB2. Again, the general idea is to reduce data early and not ship more data around than necessary.

Example 4:
The application receives an XML document from a web service and needs to examine it with an XPath query before it can decide whether the document should be inserted into the database or passed to another service. Typically, this operation should be implemented in the application server instead of calling out to the database when it might be not needed.

Example 5:
The application receives a transient XML document from a web service and based on specific element values in that document the application needs to retrieve related documents from the DB2 database . Although you could pass the full input document as a parameter to a SQL/XML database query and perform the entire processing in DB2, it can be easier and more efficient to first extract the relevant values from the input document in the application layer and then send a simpler and more targeted query to DB2, for potentially better end-to-end performance.

Example 6:
The application needs to retrieve XML documents from the database and apply logic that has already been coded in XQuery modules and XQuery user-defined functions. Such modules are an optional feature of the XQuery language standard. Modules are supported in the WebSphere XML Feature Pack but not in DB2. Obviously, the choice is to retrieve the full documents from DB2 (similar to Examples 1 and 2) and then use the additional XQuery capabilities of the WebSphere XML Feature Pack on these documents.

 

These examples are certainly not an exhaustive of list potential scenarios. Also, they are not meant to be hard rules, but just illustrations of the trade-offs and considerations that can be involved in the decision process when you split XML processing between the application server and the database.

 

For concrete coding examples with the WebSphere XML Feature Pack and DB2 pureXML, see the following articles:

“Programming XML across the multiple tiers: Use XML in the middle tier for performance, fidelity, and development ease” – http://www.ibm.com/developerworks/library/x-xmlfeat1/

“Programming XML across the multiple tiers, Part 2: Write efficient Java EE applications that exploit an XML database server” – http://www.ibm.com/developerworks/xml/library/x-xmlfeat2/

 

Native XML support in a database such as DB2 allows you to store, index, query, validate, and update XML documents efficiently. In particular it’s the database support for XQuery and SQL/XML that enables powerful queries and updates of stored XML data, or XML and relational data combined.

However, XML capabilities in the database alone are often not sufficient to manage XML across the enterprise. Integrated XML support is also required in application development tools, ETL products such as DataStage, reporting tools like Cognos, and other critical parts of the IT landscape, such as the application server.

To that end, the WebSphere XML Feature Pack provides developers with key XML technologies to efficiently manipulate XML in WebSphere Java applications. It offers support for:

  • XPath 2.0
  • XQuery 1.0
  • XSLT 2.0
  • and earlier versions of XPath and XSLT, if needed

In essence, you can write Java applications for deployment in WebSphere Application Server and within your Java code use XPath, XQuery, or XSLT directly to work with XML, including transient XML messages. One of the key advantages is that you can apply XPath and XQuery expressions directly to your XML data instead of coding equivalent operations much more tediously via a low-level SAX or DOM API or other object models. This has several benefits:

  • Simplified XML application development
  • Increased developer productivity
  • XML operations are easier to understand and maintain over time

If you think about it, these are the same reasons why you implement many applications in Java (or similar) rather than Assembler: you use the comfort and expressiveness of a higher level language (XQuery) to avoid the complexity, tediousness, and “error proneness” of a lower-level approach.

A second set of benefits emerges when you look at the handling of XML across database and application server. On the database side, the benefits of storing and processing XML natively instead of shredding them to relational format are well-known:

  • Reduced complexity, because shredding any but simple XML documents is a complicated task
  • Increased flexibility, because native XML allows for document variability and schema evolution, whereas shredding relies on a static mapping that breaks easily when new document variations are encountered
  • Improved performance, because the conversion between two different data representations is avoided
  • Guaranteed document fidelity, because the original XML business record stays in its XML format

The very same benefits can be  realized in the middle tier by processing XML natively in the application server. With the XQuery, XPath, and XSLT support in the WebSphere XML Feature Pack you can manipulate XML in its original XML format without having to convert it to a different representation, such as an object structure.

With increasing complexity or variability of your XML documents, converting XML to Java objects becomes increasingly complicated, inflexible, and slow. For example, using JAXB to map XML to objects requires Java classes to be generated that constitute a fixed and inflexible mapping.

And finally, treating XML as XML in the database and the application server realizes all of the above-mentioned benefits across the middle tier and the database by eliminating any impedance mismatch and data conversion between the tiers.

Here are some resources where you can learn more about the IBM WebSphere Application Server V7.0 Feature Pack for XML and see some concrete examples of its use.

Product page and download:
http://www.ibm.com/websphere/was/xml

The IBM Red Paper: “Getting Started with the WebSphere Application Server Feature Pack for XML”, includes many examples:
http://www.redbooks.ibm.com/abstracts/redp4654.html?Open

The XML Feature Pack documentation:
http://publib.boulder.ibm.com/infocenter/wasinfo/fep/index.jsp?topic=/com.ibm.websphere.xmlfep.multiplatform.doc/info/ae/ae/welcome_fepxml.html

And a collection of links to further information:
http://webspherecommunity.blogspot.com/2009/11/websphere-application-server-feature.html

Relational databases have supported SQL stored procedures for a log time. The benefits of SQL procedures are well-known and include:

  • Reduced coding labor due to the creation of reusable processing modules
  • Richer processing capabilities in the databases by defining custom data processing logic
  • Improved performance and reduced network traffic because stored procedures are executed close to the data, i.e. in the database engine itself

With the introduction of XML support in relational databases, several database products (such as DB2) allow you to access and manipulate not only relational data but also XML in stored procedures. This is made possible through:

(a) the introduction of the XML data type in the SQL type system, and

(b) several XML functions in the SQL language such as XMLQUERY, XMLEXISTS, and XMLTABLE that can contain XPath or XQuery expressions.

The picture below provides an example of some of the basic capabilities for XML handling stored procedures:

  • Input and output parameters of type XML. In DB2, if an XML document already exists in a parsed format in the database then passing it to the XML parameter of the procedure does not require the document to be reparsed. XML procedures can operate directly on the parsed representation of the XML. Also, if an application calls a stored procedure and passes a textual XML document to the XML parameter, the document gets parsed only once upon entering the procedure regardless of how often the XML parameter is used in the body of the procedure.
  • XML type variables. XML variables can be assigned values of type XML, such as a document or some document fragment, even a single XML element or atomic value. XML variables hold XML in a parsed format, which is important for performance. For example, the second and last SET statement in the picture below uses an XQuery FLWOR expression to read a certain address from the input document and assigns it to the XML variable “address”.
  • XML value extraction. The XMLQUERY function can contain XPath or XQuery expression to extract values from a document. In the first SET statement in the picture below, the XMLQUERY function on the right-hand side takes the XML input parameter “custDoc” as context and uses an XPath to extract the @Cid attribute from the document. The XMLCAST function converts the attribute to the SQL type INTEGER, so that it can be assigned to the SQL variable “id”.
  • XML conditions. If-then-else statements in the SQL procedure language can use XML conditions in the IF-clause to decide what to do next in the control flow of the stored procedure.

(Click the picture for full size)

The examples above are certainly not an exhaustive list of what you can do with XML in stored procedure. For example, while the stored procedure above only operates on a single document that is passed in via the XML parameter, stored procedures can also use XQuery or SQL/XML to access sets of XML documents that are stored in XML columns.

Stored procedures can also compose, prepare, and execute XQuery or SQL/XML statements dynamically. You can define cursors over the result sequence of an XQuery and process each returned item one by one in a loop. You can also construct, split, shred, modify, merge, compare, or validate XML documents in stored procedures.

Of course, all the error and exception handling that is available for stored procedures can also be used with any XML operations. In summary, you can code sophisticated XML processing logic (or hybrid XML/relational processing logic) with the familiar concepts of SQL stored procesures.

Many more examples of XML processing in DB2 Stored Procedures can be found in Chapter 18 of the DB2 pureXML Cookbook.

XML is a flexible and extensible data format and is often used for messages or other document types whose structure can have multiple different variations or evolves over time. As a result, you might not be dealing with a single fixed document schema but with significant schema variability, or schema diversity.

Many XML database systems allow you to store documents for many different schemas (or different version of the same schema) in a single collection. This is also true for DB2, where a collection is an XML column that is not necessarily tied to a single XML schema. Hence, applications can store, index, query, validate, and update documents for different schemas in a single XML column.

When you validate different documents in an XML column against different XML schemas, DB2 remembers which document was validated against which schema. When you later retrieve a specific document, you can easily determine which schema it was validated against when it was inserted or last updated.

A common question is: How to effectively index XML documents when there is schema variability or schema diversity?

At a high level, DB2 provides three options:

  • Define various individual XML indexes for the various elements and attributes that you require to be indexed for good performance
  • Index all elements and/or all attributes
    • Creating an XML index on the path //text()   indexes all text nodes, i.e. basically all element values
    • Creating an XML index on the path //@*  indexes all attributes
  • Use a full-text index that indexes all elements and attributes, using DB2 Text search

The good thing is that DB2 allows you to “index everything” but it doesn’t force you to do so – unlike some other database systems. This is important, because indexing all elements and all attributes in all documents requires a lot of space and impacts the performance of insert, update, and delete operations. Therefore, indexing everything should be a conscious choice.

Although full-text indexing with DB2 Text search can address schema variability, I think the decision to use Text Search should be predominantly based on the type of searches that need to be performed. If the applications commonly perform full-text search for phrases and substrings, or maybe even require fuzzy search, stemming, and other fancy text features, then DB2 Text Search should be used. However, if the data is more structural than textual in nature, and if most predicates use regular =, >, >=, etc. operators, then I would usually prefer a set of regular XML indexes.

A key benefit of XML indexes in DB2 is that they contain index entries only for those document that they match. For example:

Let’s say we have 10 Million XML documents in an XML column, for 10 different schemas, approx. 1M documents per schema. If we define an XML index on an XML element (XPath) that exists in only 1 of the 10 schemas, then this index will hold entries only for 1M documents. If this element is an optional element that occurs in only 20% of the documents for this 1 schema, then this index will hold entries only for 200,000 documents. Thus, this index will be substantially smaller than an index that contains entries for all documents.

As a result, defining a number of specific XML indexes for various elements and attributes is very space efficient! So, if you can identify which nodes need to be indexed for good performance, you can index them in a space-conscious manner.

The index maintenance cost for insert/update/delete operations follows the same guideline as in the relational world: the more indexes you define the more CPU and I/O is being used for update the indexes. But, let’s say I have defined 50 XML indexes on an XML column, and the document that I am inserting contains elements and attributes that match only 5 of those indexes, then only those 5 indexes will be updated when the document is inserted.

A large number of XML indexes defined on a single XML column is not a concern for query performance. The DB2 query compiler is very efficient at determining quickly which index can (and should) be used for a given query.

Also note that XML indexes in DB2 can be defined with wildcards, which gives you a lot of option between defining an index for just one element and defining an XML for all elements. Some examples:

  • Let’s say you are indexing purchase orders and you know that all of most the customer address information should be indexed. If you know that the path to the customer address information is /purchaseorder/customer/address  but the subtree under the <address> element is subject to variability, you could define an index on /purchaseorder/customer/address//text(), to index all element values in the subtree under <address>.
  • If you know that there is only one level of elements under address, you could define an index on /purchaseorder/customer/address/* to index all of these elements, and maybe another index on /purchaseorder/customer/address/*/@* for the attributes of these elements, if any.
  • If you know that you need to index the <zipcode> element under <address>, but the <address> can be in different subtrees for different documents, you could define an index on //address/zipcode or  /purchaseorder//address/zipcode.

For more information on XML indexes, see

http://www.ibm.com/developerworks/data/library/techarticle/dm-0611nicola/

http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/topic/com.ibm.db2z10.doc.xml/src/tpc/db2z_indexxml.htm (DB2 for z/OS)

http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0024039.html (DB2 for Linux, UNIX, Windows)

A short summary of why and how the New York State Department of Taxation & Finance is using DB2 pureXML is available here:

http://smarterquestions.org/2011/06/a-true-story-about-using-xml-forms-and-db2-to-process-millions-of-tax-returns/

Many of the reasons why XML is useful for forms processing also apply to other forms-based application, e.g. in healthcare, insurance, government, and other industries. Most cases of forms-processing share the following characteristics:

  • Many different types of forms are being used, each one having a different set of fields. (This is schema diversity!)
  • Some forms tend to change over time, usually to support new or changed business processes or regulatory requirements. Simply put, fields on the forms get changed, added, or removed. (This is schema evolution!)
  • When information is filled into any given form, there are typically many optional fields that can -and often will- remain blank. (This is sparsely populated data!)

As it turns out, XML is very well suited to handle schema diversity, schema evolution, and sparse data. Optional XML elements and attributes that may or may not appear in a given document are convenient to handle sparse data, whereas in a relational model sparse data often leads to large numbers of NULL values.

XML also supports schema flexibility, e.g. documents for different schemas (or different versions of the same schema) can be stored, indexed, and queried in a single XML column.

And finally, if most of the application operations touch one form at a time, e.g. insert a form, retrieve a form, update a form, validate a form, and so on, then an XML solution where each form is a single XML object typically also provides significant performance benefits over a normalized relational schema for the same information.

 

 

In DB2 for Linux, UNIX, and Windows, the quickest and easiest way to produce XML from relational tables is to use the function XMLROW or XMLGROUP. The following examples illustrates how this works.

Let’s first define a relational table and insert a couple of rows:

create table employee(id int, name varchar(30), dept char(3));

insert into employee values (1, ‘John’, ‘PX8’);
insert into employee values (2, ‘Mary’, ‘KH2’);

select id, name, dept  from employees;

ID             NAME                          DEPT
———– —————————— ————————
          1     John                               PX8
          2    Mary                               KH2

  2 record(s) selected.

Now, let’s select the same columns with the XMLROW function:

select xmlrow(id, name, dept) from employee;

<row><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></row>

<row><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></row>

  2 record(s) selected.

You can see that each row is returned as an XML document, using column names as element names. And you can further customize the XML if you like. For example, you can rename the root element for each row:

select xmlrow(id, name, dept OPTION ROW “EMPL”) from employee;

<EMPL><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>

<EMPL><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>

  2 record(s) selected.

You can also rename the elements for individual columns, if you want:

select xmlrow(id AS “EMP_NO”, name, dept OPTION ROW “EMPL”) from employee;

<EMPL><EMP_NO>1</EMP_NO><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>

<EMPL><EMP_NO>2</EMP_NO><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>

  2 record(s) selected.

And if you prefer to combine all selected rows into a single XML document, you would use XMLGROUP instead of XMLROW:

select xmlgroup(id, name, dept OPTION ROW “EMPL”) from employee;

<rowset>
  <EMPL><ID>1</ID><NAME>John</NAME><DEPT>PX8</DEPT></EMPL>
  <EMPL><ID>2</ID><NAME>Mary</NAME><DEPT>KH2</DEPT></EMPL>
</rowset>

 1 record(s) selected.

For more options and information on these functions, see:

XMLROW:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0050587.html

XMLGROUP:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.sql.ref.doc/doc/r0050588.html

SQL/XML publishing functions for constructing XML values:
http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.xml.doc/doc/c0023909.html

In my previous blog post I promised that I would provide some examples of how to insert information from multiple relational tables and multiple rows into an XML document.

So, let’s continue the example from my previous post where we had a table with relational INTEGER and DATE columns as well as an XML column. For simplicity, we now use only one XML document in this XML column, but it could be many more:

DROP TABLE documents;
CREATE TABLE documents(document_id INTEGER, created DATE, document XML);

INSERT INTO documentsVALUES(2, '2011-05-21',
'<mydoc>
   <header>
     <title>Introduction to XML</title>
     <author id="23">Peter Pan</author>
     <author id="49">Susie Swing</author>
   </header>
   <body>
     <info>What is XML?</info>
   </body>
 </mydoc>');

Let’s add a second table “document_reviews” that contains a rating and review for each document. Documents and reviews are related to each other and can be joined based on the document_id column:


CREATE TABLE document_reviews(document_id INTEGER,
                              rating CHAR(2),
                              review VARCHAR(32000) );

INSERT INTO document_reviews VALUES(2, 'A+', 'This books is...');

Now we want to to read the XML document from the “documents” table and insert the corresponding rating information from the “document_reviews” table into the XML. This task is quite easy, using the techniques described in my previous post and simply using a join between the two tables:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert <rating>{$RATING}</rating>
                           into $new/mydoc/header
                 return $new' )
FROM documents, document_reviews
WHERE documents.document_id = document_reviews.document_id;

Note that the XQuery variable $RATING refers to the “rating” column in the document_reviews table. And the result of this query is the original document with the desired <rating> element inserted into the <header>:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

1 record(s) selected.

Now it gets more interesting. What would the query above return if there were *multiple* review rows for this one document? Well, let’s find out! Let’s insert a second review for the same document_id and run the query again.

INSERT INTO document_reviews VALUES(2, ‘C’, ‘I did not like…’);

After inserting this additional review, our query above returns the XML document twice, once for each of the two reviews:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>C</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

2 record(s) selected.

The first document in the result set contains a <rating> element with the rating from the first review, and the second document contains a <rating> element according to the second review. The original XML document is duplicated for each rating, due to the nature of the relational join between the two tables.

Although this form of output might be useful in some cases, chances are that you would like all ratings for this document to be included in a single copy of the document. This can be achieved if we convert the join into a correlated subquery that produces a sequence of <rating> elements that are all inserted into a single copy of the document:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert db2-fn:sqlquery("
                             SELECT XMLELEMENT(NAME ""rating"", rating)
                             FROM document_reviews
                             WHERE document_id = parameter(1)" ,
                             $docid            )
                           into $new/mydoc/header
                 return $new'
                 PASSING documents.document_id as "docid" )
FROM documents;

In this query, the modify clause contains the function db2-fn:sqlquery that takes an SQL query and optional parameters as input. This SQL query reads the rating column from the document_reviews tables and constructs a sequence of <rating> elements.

The WHERE clause of this subquery is parameterized, using $docid as the parameter value (line 5). The $docid is passed into the XQuery context from the documents.document_id column (line 9), and ensures that the subquery only reads ratings for correct document. As a result, a single document with two <rating> elements is returned:

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <rating>A+</rating>
    <rating>C</rating>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

1 record(s) selected.

I think these example provided a good idea of the powerful capabilities that exist to combine XML and relational data.

The beauty of a “hybrid” database system is that it allows you to manage both XML and relational data at the same time, even side-by-side in the same table if you want. You can choose which data to store in XML and which in relational format. Best practices for determining the “right” mix of XML and relational in a database schema is an interesting topic, but I’ll reserve that for a separate blog post in the near future.

Today, let’s look at how you can insert information from relational columns into an XML document. Let’s start with a simple table and a couple of rows. Each row holds an XML document in an XML column as well as a document_id and the date when the document was created:

CREATE TABLE documents(document_id INTEGER, created DATE, document XML);

INSERT INTO documents
VALUES(1, '2011-05-20',
'<mydoc>
   <header>
     <title>How to skin a cat</title>
     <author id="17">John Doe</author>
   </header>
   <body>
     <info>There are many ways...</info>
   </body>
 </mydoc>');

INSERT INTO documents
VALUES(2, '2011-05-21',
'<mydoc>
   <header>
     <title>Introduction to XML</title>
     <author id="23">Peter Pan</author>
     <author id="49">Susie Swing</author>
   </header>
   <body>
     <info>What is XML?</info>
   </body>
 </mydoc>');

Now let’s assume we want to retrieve these documents and include the “created” date from the relational column as an XML element <created> within the <header> element. This is not hard. You can simply include an XQuery update expression in your query that retrieves the documents:

SELECT XMLQUERY('copy $new := $doc
                 modify do insert <created>{$crdate}</created>
                           into $new/mydoc/header
                 return $new'
                 PASSING documents.document as "doc",
                         documents.created  as "crdate")
FROM documents;

The XMLQUERY function contains the XQuery update expression (in single quotes) and the PASSING clause that passes column information to the XQuery expression. Here we are passing the XML columns “document” and the DATE column “created”, so that they can be referenced as XQuery variables $doc and $crdate.

DB2 for Linux, UNIX, and Windows also allows an abbreviated syntax, i.e. you can omit the PASSING clause and refeence the column names directly (in upper case), as in the following equiavalent query:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify do insert <created>{$CREATED}</created>
                           into $new/mydoc/header
                 return $new' )
FROM documents;

The XQuery update expression assigns the XML from the XML column ($DOCUMENT) to the variable $new and then it modifies and returns $new. The “modify” clause inserts the new XML element <created> into the /mydoc/header element, using the value from the relational DATE column that is referenced here as $CREATED.

As a result, both of the queries above return the following output, with the desired <created> elements:

<mydoc>
  <header>
    <title>How to skin a cat</title>
    <author id="17">John Doe</author>
    <created>2011-05-20</created>
  </header>
  <body>
    <info>There are many ways...</info>
  </body>
</mydoc>

<mydoc>
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <created>2011-05-21</created>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

It’s that simple!

Note that these queries do not change the documents that are stored in the “documents” column. They only modify the documents on the fly while reading them out. If you want to permanently change the documents, you would use the exact same XQuery update expression in an SQL UPDATE statement rather than a SELECT statement, like this:

UPDATE documents
SET document =  XMLQUERY('copy $new := $DOCUMENT
                          modify do insert <created>{$CREATED}</created>
                                    into $new/mydoc/header
                          return $new' );

But, let’s stick with queries for now and extend the XQuery update expression to also merge the relational document_id into the XML. Specifically, assume we need to include the document_id as an attribute “docid” in the “mydoc” element. We can simply add another “insert” operation in the modify clause, which is easy because the modify clause can have a comma-separated list of operations, including insert, replace, and delete operations.

To add an attribute you have to specify “insert attribute” followed by the desired attribute name (docid) and the attribute value in curly brackets. The curly brackets can either contain a literal value in double quotes or -as in our case here- a reference, such as our reference to the relational DOCUMENT_ID column:

SELECT XMLQUERY('copy $new := $DOCUMENT
                 modify (do insert <created>{$CREATED}</created>
                            into $new/mydoc/header,
                         do insert attribute docid {$DOCUMENT_ID}
                            into $new/mydoc )
                 return $new' )
FROM documents;

And the result is as desired:

<mydoc docid="1">
  <header>
    <title>How to skin a cat</title>
    <author id="17">John Doe</author>
    <created>2011-05-20</created>
  </header>
  <body>
    <info>There are many ways...</info>
  </body>
</mydoc>

<mydoc docid="2">
  <header>
    <title>Introduction to XML</title>
    <author id="23">Peter Pan</author>
    <author id="49">Susie Swing</author>
    <created>2011-05-21</created>
  </header>
  <body>
    <info>What is XML?</info>
  </body>
</mydoc>

Going one step further, you might wonder how to insert data from relational columns in *other* tables into these XML documents. Well, you can do a join so that the desired values from the other table line up in the same row as your XML document, and then apply the same techniques as above.

It get’s slightly more interesting if you want add information from *multiple* rows into a single document, based on a one-to-many relationship. I can show you how to do this, but I’m about to board a flight to go  vacation, so you’ll find a couple of examples in my next blog post when I’m back, or many more examples in the DB2 pureXML Cookbook!

Intel has recently launched their new “Westmere-EX” CPUs that are now part of the Intel Xeon E7 processor family. For 5 years now Intel has a strong history of testing and showcasing many of their latest processors with the Transaction Processing over XML (TPoX) benchmark. So it comes as no big surprise that Intel has now also released TPoX benchmark results with their latest 10-core Xeon E7-4870 processor.

This benchmark uses TPoX Version 2.0 with 1TB of raw XML data, running on DB2 9.7 under SUSE Linux 10. DB2 compression reduces the 1TB of raw data to 537GB, which includes all required XML indexes.

The system under test is an Intel Xeon E7-4870 processor server with 4 CPUs, 40 cores, 80 threads, and a clock speed of 2.4GHz. Further details on the CPUs can be found here. The storage configuration consists of 54 Intel X-25E Solid State Drives (SSDs), 48 SSDs for the database tables and indexes plus 6 SSDs for the DB2 transaction log.

To showcase the maximum possible TPoX throughput on this hardware, Intel ran the TPoX workload driver with 440 concurrent connections (“users”) all of which issue a random stream of read and write transactions without think time. The result is a whopping 17,757 TTPS (TPoX Transactions per Second)!

This result constitutes 29% higher performance than on the previous generation CPU, the 8-core Nehalem-EX, which achieved 13,745 TTPS. Details on these TPoX benchmark results are available on the TPoX results page where you find a history of all TPoX results as well as examples of other TPoX usage.

The great thing about these benchmarks is that they continuously push hardware, operating system, and database to the edge, relentlessly exposing any inefficiencies that may exist and giving us the chance to continuously improve the technology. This latest TPoX benchmark is also one of the first TPoX tests that runs entirely on solid state drives!

Now it’s only one week until the International DB2 User Group (IDUG) holds their annual North American conference, this time in Anaheim, CA, on May 2 – 6. The conference agenda looks terrific, with more than 100 technical sessions, full-day seminars, panel discussions, vendor solution presentations, and of course lots of opportunity for networking.

The technical program also includes several sessions on XML:

Session G04: “Relational vs. pureXML: Battle of the Titans”
Konstantin Tadenev (UPS), Tuesday, May 3,  4:30pm

Session E08: “Real-World XML Application Development Patterns for DB2”
Fred Sobotka (FRS Consulting), Wednesday, May 4, 2:45pm

Session E14: “The XMLTABLE Function– A Swiss Army Knife for Application Developers!”
Matthias Nicola (IBM Silicon Valley Lab), Friday, May 6, 9:15am

I’m really looking forward these sessions.

Konstantin’s session should be particularly interesting, especially for those people who remain doubtful about XML performance. Konstantin and his colleagues have conducted a comparative performance study where they addressed a real-life business problem first with traditional relational database means and then with DB2 pureXML.  The results illustrate considerable throughput improvements and CPU time savings for the pureXML solution!

Fred’s session is certain to be very interesting as he will be sharing his rich experience from a variety of DB2 pureXML projects that he has been involved in. These include projects in e-commerce, government, and Fortune 500 companies where native XML storage, XML shredding, and construction of XML from relational data were required.

I hope to see you in Anaheim!