A couple of months ago I wrote about the updated version 9.7.2 of the free DB2 Express, which includes all of DB2’s base features including DB2 pureXML. Now there is a DB2 Express-C Light version for Windows.

“Light” means that  the DB2 user interface is only in English (other languages cannot be chosen during install), and that the DB2 Control Center, DB2 Text Search, the Configuration Assistant, and the “First Steps” tutorial components are not included. This reduces the size of the download image by 44%!

The pureXML features are still included in the light version. If you do want a graphical user-interface you may want to get the free Version of Data Studio. Yes, that’s a separate download with its own footprint, but it contains many features -including XML features- that aren’t in the Control Center.

Once you get into the download dialog of the DB2 Express-C web page, you can choose either the full or the light version, as shown below:

Structure-agnostic XML indexes are indexes that you can define without knowledge about the structure and format of the XML documents that will be stored and queried. That is, the XML element and attribute names and their nesting is not known in advance. In such cases, the future queries and their search conditions are often also unknown.

Whether you need structure-agnostic XML indexes or not depends on your XML application. Let’s consider two cases:

1. Known XML Schema (often good for path-specific indexes):

Many XML applications have an XML Schema that defines the expected and permissible XML document format, i.e. you do know the XML element/attribute names and their nesting ahead of time. With some knowledge of the application you can then make an educated guess which elements or attributes are likely to be used in search conditions or join predicates.

With this knowledge you can define path-specific XML indexes, to index just those elements and attributes that likely appear in predicates. This is similar to the common practice for relational database tables, where you don’t index all columns but only those that are frequently constrained in queries.

For example, if you expect predicates on the zipcode of customer address information, and you know that the path is /customer/addr/zipcode, then you can define the foll0wing index in DB2:


create index zipIdx on customer(xmldoc) generate keys using
xmlpattern '/customer/addr/zipcode' as sql varchar(20);

DB2 also offers an XML index advisor that looks at an XML workload (a set of XML queries) and recommends a set of beneficial XML indexes for specific elements and attributes.

Path-specific indexes have the benefit that you index only what needs to be indexed, which conserves storage space and supports high insert/update/delete performance for XML documents.

Interestingly, Microsoft SQL Server does not support path-specific XML indexes. It forces you to index ALL elements and ALL attributes, even if you require only some of them to be indexed.

2. Unknown XML Schema (structure-agnostic indexes):

If the schema and expected queries are unknown, then you can choose to “index everything”. For example, in DB2 you can define the following two indexes. The first index covers the text nodes of all XML elements, and the second index covers all XML attributes.


create index elemIdx on customer(xmldoc) generate keys using
xmlpattern '//text()' as sql varchar(7985);

create index attrIdx on customer(xmldoc) generate keys using
xmlpattern '//@*' as sql varchar(7985);

Subsequently, query predicates on an element’s text node or on any attribute can use one of these indexes.

These two indexes are limited to element and attribute values of up to 7985 bytes per value, which is more than sufficient for most applications. Alternatively you can also define these indexes as VARCHAR HASHED, to index strings of unlimited length via hashing. Either way, these indexes are always maintained synchronously, i.e. upon or before the commit of any insert, update, or delete operation on the collection of XML documents.

Another option for structure-agnostic XML indexing is the DB2 Full-Text search feature, which I wrote about in February.

Structure-agnostic indexes that “index everything” are somewhat more expensive than selective path-specific indexes in terms of space and insert/update/delete performance. But, that’s the price for complete flexibility and for supporting unknown queries over unknown documents.

For more information on XML indexing in DB2, see the article on index exploitation or Chapter 13 in the DB2 pureXML Cookbook.



In my previous post I promised to write more about the 10TB XML data warehouse benchmark that IBM and Intel performed earlier this year.

Before I get into the details of that benchmarking exercise, let me talk about how and why we got there.

The first wave of XML database applications were (and continue to be) operational systems, often performing message-based transaction processing with XML as the message format. Corresponding to the adoption and growth of such operational XML applications, IBM and Intel have collaborated to verify that current hardware (such as Intel servers) and software (such as DB2) were able to meet the demands of XML transaction processing systems. This started with moderate 50GB and 100GB XML benchmarks in 2006, all the way to the industry’s first 1TB XML benchmark in late 2008.

As companies accumulate large amounts of XML data in their operational system, they realize that they are sitting on a goldmine of information and are eager to use it for reporting and business intelligence purposes. Accustomed to the benefits of mature relational data warehouses, many companies now require and expect the same capabilities for XML data. They want to run complex analytical queries over XML. Therefore we want to establish performance and scalability proofpoints to show that this is indeed possible.

Benchmark goals and methodology

The goal of the 10TB benchmark was to demonstrate linear scalability when an XML data warehouse grows in size. Many data warehouses grow over time as new data keeps being added to the existing data. Running analytical queries over larger amounts of historical data can provide a more accurate understanding of the business and allow more accurate predictions for the future.

The value of linear scalability is the following: you can increase the data volume in your warehouse, add computing resources (CPUs, disks, memory) proportionally to the data, and hence keep the performance of your workload constant. For example, if a query analyzes 2x as much data and is given 2x as much resources, then the elapsed time should be the same.

An equivalent way to think about linear scalability is that you keep the data volume constant and e.g. double the hardware resources to reduce the response times by half.

In this benchmark we ran a decision support workload -consisting of 16 complex analytical SQL/XML queries- on two databases:

(A) 3.33TB of raw XML data and using 1/3 of the available computing resources
(B) 10TB of raw XML data and using all of the computing resources

We expect to see the same performance (response times, throughput) on both databases. On the 10TB database each query processes 3x as much data as on the 3.33TB database. But, with 3x as much resources, this will be accomplished in the same amount of time as on the 3.33TB database. That’s linear scalability.

The 10TB benchmark configuration

DB2 9.7 Fixpack 1, using pureXML and compression, was run on a cluster of three Intel Nehalem EX Servers (Xeon 7500). Each of the servers had four 8-core CPUs and 128GB of memory. The operating system was Linux RHEL 5.4. The storage subsystem was an IBM DS8700 system, housing 48 RAID5 arrays of 8 disks each. The DB2 database partitioning feature was used, running 48 database partitions on this cluster, i.e. 16 on each of the three servers.

The 3.33TB database used only one of the three servers in the clusters, 16 database partitions, and only 16 of the 48 RAID5 arrays.

Data Volume and Workload

We used the XML data generator of the TPoX 2.0 benchmark.

Database A: 3.33 TB of raw data, about 1.83 Billion XML documents:

  • 1,666,500,000 orders
  • 166,500,000 customers with their accounts
  • 20,833 securities

Database B: 10TB of raw data, about 5.5 Billion XML documents:

  • 5,000,000,000 orders
  • 500,000,000 customers with their accounts
  • 20,833 securities

Each of the two databases contained three partitioned tables, one for order documents, one for customer-accounts, and one for securities. About 20 XML indexes were defined to support the decision support workload. The 16 SQL/XML queries included full table scans, grouping and aggregation, OLAP functions, joins across two or all three XML tables, and various combinations of XML predicates. The same 16 queries were run on both databases. The selectivity of each query (in terms of percentage of XML documents touched) is the same in both databases.

Results

The following chart (click it!) compares the response times of the 16 queries in the 3.33TB database (yellow bars) and the 10Tb database (red bars). For each query we can see that the response is approximately the same on both databases, which confirms the linear scalability that is so important in data warehousing.

I have purposefully omitted absolute numbers from the vertical axis in this chart, because this result and this benchmark was not about maximizing absolute performance. It was about scalability, i.e. relative performance between two different scale factors and configurations. For example, some queries could have been even faster -in both databases- if we had used twice as many DB2 database partitions per server.

A summary article on this and other XML database benchmarks can be found in the IBM Data Management Magazine.