A few people have recently asked about the ability to ensure referential integrity in XML data. In this case, they are referring to the referential integrity feature that is common to relational databases. I thought I’d take a few moments to share one possible answer with you.

Referential integrity is a relational database feature that ensures consistency is maintained between items that reference one another. For instance, if you have a database table with order information that refers to a database table with product information, referential integrity ensures that each order refers to a valid product. In technical terms, referential integrity ensures that each product ID specified in the order table (where the product ID is a foreign key in the order table), exists in the product table.

Over the years, DBAs have found referential integrity to be a valuable feature in their relational databases. And now, they are asking how to ensure referential integrity with XML data.

The XML standard does not include mechanisms for ensuring referential integrity amongst XML elements. You can use schemas to place constraints on the XML data, but these constraints are not really enforced by the database (except insofar as the database is used to validate via a schema). And the ability to place constraints on XML data doesn’t quite add up to the ability to the ability to ensure the same level of consistency checking offered by referential integrity in a relational database.

However, there is a possible answer. This is one of the cases where using a hybrid relational/XML database proves to be very useful. You extract values from certain XML elements and store those values in relational columns in the same table, and then place constraints on those relational columns to ensure referential integrity. There is a cost associated with this approach, due to increased storage and additional programming logic. However, for many, this increased cost is justified by the ability to ensure referential integrity.

I thought I’d take a few moments and provide a quick report on the Information on Demand conference in Las Vegas. Like the past couple of years, it was a very interesting conference from a native XML storage point-of-view. There were all sorts of interesting sessions, including sessions about:

  • Innovative uses of hybrid relation/XML modeling
  • Native XML storage in SOA environments
  • Native XML storage for electronic medical records
  • Native XML storage for government information sharing
  • Native XML storage for electronic forms
  • And much, much more

For me, one of the highlights of the show was a session delivered by Li Cui and Charlie Wang from UCLA Health System. They described the different implementations of electronic medical records at UCLA Health System, and the benefits they are realizing from the adoption of native XML storage. Their presentation materials were simply outstanding, offering a very clear before and after indication of the impact of native XML storage. In their words, DB2 pureXML “significantly simplifies database design, web services development and robustly provides capability to handle XML data”. We heard about how it now takes hours instead of weeks to set up new types of electronic medical records, as well as a number of other quantifiable benefits. But it was the impact on the patients that left the greatest impression on me. It was great to hear how–thanks to their new system–patient records are now available much sooner, which obviously has a very real and significant impact on patient care.

Across all of the native XML storage sessions, some themes did emerge. One theme is that many organizations are using native XML storage to optimize their Service Oriented Architecture (SOA) environments. XML is the de facto language for information exchange, and there are a variety of use cases where organizations are persisting that XML in native format. These include the storage of transactional data, the implementation of a service bus cache, the logging of user events, and more. Another theme is that many organizations are using XML to augment traditional relational types. There are situations where modeling data using traditional relational types is challenging. One of those situations is where the structure of data changes often. In such cases, storing the fixed data in traditional relational columns and storing the data whose format changes often in XML format is an ideal solution. It takes advantage of the relative strengths of traditional relational types, as well as taking advantage of the flexible nature of XML for data whose format changes often.

I’m looking forward to the next conference… IOD Europe in Berlin next June.

As you saw in my previous post, IBM DB2 has spectacular performance numbers when benchmarking with XML data. As a reminder, DB2 performed 48.5 million transactions from 200 concurrent users over a two hour period on a server 4 CPUs, where each CPU had 6 cores. Admittedly, this is a relatively powerful server. But then again, you would expect to need a powerful server to process 48.5 million transactions in two hours. Today’s post will explain one of the key reasons why DB2 offers such impressive performance when compared to other major database vendors.

Most database vendors perform a similar set of steps when working with XML data. They parse the XML data, translate the XML data into a stream of tokens, logically represent the stream of tokens in a tree-like structure, store the tree-like structure, create indexes, and then manipulate and query the data as necessary. Of course, different vendors implement these steps in different ways, leading to some vendors enjoying a performance advantage in certain regards. For instance, some vendors have a higher reliance on large indexes than others, slowing data inserts, updates, and deletes. These vendors also typically require greater amounts of storage because of the need for these large indexes. However, it is query performance that concerns most people, so let’s focus on that.

Working with XML data poses an additional set of challenges when compared to working with traditional relational data. Many operations with XML data involve navigating around the tree-like structure that represents the XML data. For example, a query may impose a predicate on the values of the City and Age XML tags that appear inside a Customer tag, while returning the value of the Name tag that appears inside the same Customer tag. When you break this query into the atomic operations that are performed on the physical data, the query will involve many navigations of the tree-like structure. The database must navigate to the Customer element and then to the City element within it, apply the predicate, navigate to the Age element for that customer, apply the predicate, and if the predicates are satisfied, navigate to the Name element for that customer. And the database must perform these operations for many, many rows. Appreciating the need to optimize navigations of the tree-like representation of the XML data is one of the keys to understanding why IBM DB2 performs so well.

DB2 implements the tree-like structure in a way that allows you to very quickly navigate around the tree. Because DB2 stores pointers to both parent and child nodes for each element, and because DB2 optimizes the speed with which those navigations are performed, DB2 quickly navigates around the tree-like structures that represent the XML data. So, it is a combination of the physical representation of the tree-like structure, together with special algorithms for navigating around trees, that helps IBM achieve such blistering speed. After all, if you can optimize the atomic operations that underpin this breed of database queries, it only follows that you will enjoy the benefits at query time.

I do not have direct knowledge of the internal workings of other major relational database vendors. My opinions regarding how IBM manages XML data when compared with other major relational database vendors are drawn from information that is freely available on the internet. The opinions expressed in this post are mine and mine alone. They do not necessarily represent the opinions of my employer, IBM.

IBM continues to openly benchmark its DB2 pureXML capabilities.  Last week at the Information On Demand Conference in Las Vegas, Intel and IBM released details of their latest joint benchmark.  The goal of the benchmark is to show the performance levels that you can expect for inserts, updates, deletes, and queries on transactional XML data.  This latest benchmark is with 1TB of XML data using the FIXML standard industry format.  The results are very interesting.  Thanks to the way that DB2 stores XML data, the XML data occupies less than 450GB of disk space when stored in DB2.  If you compare the results of this benchmark against previous benchmarks, you can see that the addition of 50% more processing cores provides 48% more throughput (at lower CPU utilization rates), indicating a nice scale-out story.  Intel actually measured a 4-CPU server, with 6 cores per CPU, processing more than 6700 XML-based TPoX transactions per second.  One of the key aspects of the benchmark is the minimal amount of database tuning needed to obtain these results. For more details, including hardware specifications and database configuration, see the presentation from the conference.