XML in Oracle 11g

September 30, 2008

Here are some observations about the XML storage capabilities in Oracle 11g. These observations were deduced from public sources. Please be aware that I work for IBM who competes directly with Oracle in this regard.

Oracle provides three options for storing XML data:

  • Unstructured, which is essentially Character Large OBject (CLOB) storage. Like any CLOB implementation, you will need to retrieve and parse the XML data before executing XPath and XQuery statements, which has a query-time performance impact.
  • XML-Object-Relational, which shreds the XML data into object-relational tables. There are multiple storage options to choose from (which I will cover in a moment). Oracle recommends this option for data-centric use cases. Of course, with this option, retrieving the original XML data will incur a performance hit as the data is re-composed.
  • XML-Binary, which stores the XML data as a token stream in a Binary Large OBject (BLOB). Oracle recommends this option for document-centric use cases.

If you use XML-Object-Relational, which is also known as Structured or Schema-based storage, you have five different options to choose from for storing repeating elements:

  • Store in CLOBs
  • Store in VARRAY as LOBs
  • Store in VARRAY as nested tables
  • Store in VARRAY as XMLType pointers to BLOBs
  • Store in VARRAY as XMLType pointers to nested tables

If your head is spinning with the different options, I don’t blame you.

Oracle provides a special index called XMLindex, which indexes the internal structure of XML data. Actually, this index is a table. There is an interesting post on the Oracle Discussion Forum. In this post, an Oracle user describes their experiments with XBRL data in Oracle 10g and 11g. In their experiments, case 3 runs on Oracle 11g without XMLIndex in 9.6 seconds, while case 4 runs with XMLIndex and takes 574 seconds. So, in this user’s experience, running with the XMLIndex results in approximately a 50x slowdown.

Note that if you want to create an index for numeric or date values in Oracle 11g, you must use stored procedures, which create separate indexes that you must asynchronously maintain.

Oracle now has an XML update function. It is not compliant with the W3C XQuery Update Facility. Although, Oracle did recently announce the XQuilla XQuery engine, which they claim “will (maybe) free the way for the W3C XQuery Update Facility 1.0 candidate specification / implementation, which is embedded in the XQuilla XQuery engine, for other Oracle products”. So it is possible that a future release of Oracle may support the W3C XQuery Update Facility.

The compendium of XML storage options in Oracle 11g are, in my opinion, essentially based upon existing relational infrastructure. This is very different than the approach that IBM has taken, where they have truly built native support for XML data into their database from the ground up.

Please note that these are solely my personal opinions and not necessarily those of my employer IBM.

Adam Gartenberg is a colleague of mine at IBM. He is a big advocate of online communities, and wants to ensourage online communities to be as active as possible. After all, the more we participate in online communities, the greater the benefit for everyone.

As such, Adam is dubbing October 1st to be Online Community Action Day. The idea is that, on this day, you will make your best effort to contribute in some way to an online community… any online community. For example, you could:

  • If you agree or disagree with a blog post, add a comment.
  • If there is a blog post that made a difference in your job, leave a comment saying “thank you.”
  • If you have a handy tip worth sharing, add a post to a discussion forum.
  • Sign up for an online community like ChannelDB2.

By being active, you will help make online communities better!

One of the great benefits of XML as a format for persisting data is that it is relatively easy to update the schema. XML was designed to be inherently flexible in nature. Adding, removing, and updating elements or attributes in the data and schema are relatively straighforward operations.

In the past, when persisting XML data, many organizations mapped XML data into relational tables and ended up pulling their hair out when later updating schemas. Increasingly, organizations are choosing to store this data natively in XML format.

Of course, you could store XML data in a Character Large OBject (CLOB) or a Binary Large OBject (BLOB) in a relational table and still enjoy the benefits of easier schema management. However, the overhead involved in retrieving data from a CLOB or a BLOB often makes such situations unworkable. For instance, to work with the XML data, you need to retrieve the entire CLOB or BLOB, and then you need to process the contents of the CLOB or BLOB with an XML parser, before being ready to work with the XML data. This is an awkward and inefficient architecture, incurring a significant overhead for each data read operation. Many organizations are turning to data management solutions like DB2 pureXML that allow them to natively store and process their XML data.

Here at IBM, we have come across several instances where organizations have embraced the flexibility of XML schemas and chosen to persist their data in a native XML format:

  • One of the world’s leading telecommunications companies recently overhauled its order entry systems. Designing an order entry system that caters for many thousands of products and services in a variety of geographies is a tremendous challenge, especially when designing a schema that can cater for current and future offerings. It was for this reason that the telecommunications giant decided to store their order data in XML format. Thanks to the flexible nature of XML schemas, they can cater for their existing complex needs, while minimizing the impact of later introducing innovative new products and services.
  • Taxation authorities are faced with taxation rules and taxation forms that change on a yearly basis. Therefore, their data schemas must change each year. Some years, the changes consist of relatively straightforward field additions. However, some years the changes consist of larger reorganizations, which are much more difficult to manage. For taxation authorities, being able to easily manage schema changes from year-to-year is a compelling reason to move to persisting data in an XML format. IBM is currently working with multiple taxation authorities around the world to improve their tax collection systems. If you want to read about one such taxation authority’s experiences, check out New York State tax agency uses pureXML to simplify filing of more than 2 million returns already.
  • A Japanese software company developed a system that stores and manages diverse information for education establishments. Because each educational instituation has different storage needs, and because those storage needs evolve over time, this company discovered that using the relational model was both cumbersome and expensive. You can read more about this company and see some great quotes about their switch from relational to XML at Software Research Associates Tohoku chooses IBM DB2 9 with pureXML for UniVision+EV system.
  • A leading Chinese energy and utilities corporation developed a flexible data analysis and reporting system that could handle data from extremely diverse facilities across its more than 100 constituent companies. Their approach is to store common information in relational format and to store information from diverse sources–that have different schemas–in XML format. By using such a hybrid relational/XML approach, they are able to take advantage of flexible XML schemas to easily accommodate additions, updates, and changes. For more information about this situation, see China Huadian Corporation chooses IBM DB2 9 with pureXML to integrate and analyze corporate property information.

As you can see, all these companies take advantage of the flexible nature of XML schemas and native XML storage to make systems easier to manage and update, often implementing solutions that were previously difficult or impossible to do.

XML in SQL Server 2008

September 24, 2008

This is a follow-up to my look at XML in SQL Server 2005.

With the recent release of SQL Server 2008, Microsoft made updates to the XML support in SQL Server. In particular, they made improvements to the XML Schema Definitions (XSD) that they support, they added support for the let clause in XQuery FLWOR expressions, and they added support for certain XML data manipulation insertions.

It is good to see support for the XQuery let clause, although I actually removed a reference to their lack of support for the let clause from my previous post because I felt that it was not a big deal. The expanded XML schema and XML manipulation support will prove useful for users. However, as far as I know (and I am relying on the Microsoft documentation here), the primary issues remain:

  • SQL Server does not allow multiple versions of a schema in the same schema collection.
  • SQL Server does not support indexing individual elements and attributes.
  • The SQL/XML implementation includes non-standard extensions.
  • SQL Server does not support standalone XQuery.

And you still need to carefully evaluate the performance of the following for your environment:

  • Queries based on path expressions.
  • Queries against large XML documents.
  • Creating and updating indexes.

The following sources were consulted when compiling this post:

  • What’s New for XML in SQL Server 2008, SQL Server Technical Article, August 2008.

XML in SQL Server 2005

September 19, 2008

I’d like to make one thing perfectly clear before I begin this post… the following are solely my personal opinions and not necessarily those of my employer IBM.

In the past, I have mentioned that each vendor has a very different implementation of “native XML storage”. Here are some observations about the XML storage capabilities in Microsoft SQL Server 2005. Of course, Microsoft released SQL Server 2008 in August. I will review the new release in a later post.

You should be aware that I have no inside information about SQL Server and that all of these observations were deduced from public sources. You should also be aware that I work for IBM who competes directly with Microsoft in this regard. But, nonetheless, I think you will find the following information useful.

SQL Server parses XML data upon insertion and transforms it into a binary token string, which is then stored in a BLOB. This is parsed storage, but it is a stream rather than in tree format. The stream contains information about the hierarchical relationships between the the elements.

SQL Server provides a primary and secondary index to optimize query performance. The “primary XML index” in SQL Server is actually a table. You could conceivably consider this implementation to be a clever form of shredding. The primary index is a table on which the secondary indexes are defined.

When it comes to issuing queries, SQL Server supports the two industry-standard query languages for XML data: SQL/XML and XQuery. However, you should be aware that:

  • The SQL Server implementation of SQL/XML includes non-standard extensions.
  • SQL Server does not support standalone XQuery. And, in fact, SQL Server translates XQuery commands into SQL before execution.

Also, if you will have queries based on path expressions or queries against large XML documents, you should very carefully evaluate SQL Server query performance.

When it comes to support for XML schemas, SQL Server does not allow multiple versions of a schema in the same schema collection. Also, SQL Server does not allow you to alter a schema. You should be aware that these schema flexibility and schema evolution restrictions can create headaches as you work with real-world XML schemas.

Finally a few more words about indexing. Did you know that SQL Server does not support indexing individual elements and attributes? It always indexes all elements and all attributes. This leads me to recommend that you carefully evaluate the performance and logging overhead of creating and updating indexes in SQL Server for your particular use case.

The following sources were consulted when compiling this post:

  • XML Best Practices for Microsoft SQL Server, Microsoft Software Developer Network paper, April 2004.
  • XML Indexes in SQL Server 2005, Microsoft Software Developer Network paper, August 2005
  • XML Support in Microsoft SQL Server 2005, Microsoft Software Developer Network paper, December 2005
  • Documentation for MS SQL Server 2005, beta 2

In my previous post, I mentioned the great sessions at the upcoming Information on Demand conference. I thought you might be interested in soem details. Here is a small selection of the session titles and abstracts, as they appear on the conference Web site, and in no particular order:

1712 Introduction to XML and DB2 pureXML for Dummies

    Is XML in your future? Come to this friendly and informative session and learn the basics of XML, and learn why you need to know about IBM DB2® pureXML™. DB2 pureXML makes your XML projects easier and improves your application performance. All attendees will receive a free “DB2 pureXML for Dummies” booklet.

1438 Learn how Verizon Streamlined their Order System

    Verizon Business delivers advanced IP, data, voice and wireless solutions in 75 countries. Processing and tracking of orders is critical. Until recently, Verizon had multiple order entry systems, with no common place to store order information. Learn how Verizon created a single change-resistant and cost-effective order management system for all orders, regardless of order entry. Also learn how they designed and implemented the system to allow new products and services to be added on the fly, thus improving business agility and reducing time to market for new offerings.

1659 Implementing an Effective Electronic Government Solution – NY State Tax

    Hear how NY State Department of Taxation and Finance implemented a streamlined process for electronic submission of taxes. Learn about the agency’s conversion to electronic forms for tax submission and the use of IBM DB2® pureXML to efficiently store and manage the electronic form data. Find out about the realized benefits from the first year of processing and hear about the lessons learned and what future enhancements are being considered. Finally, hear a discussion of how what was learned can be applied to other government agencies.

1660 Using XML for Effective Corss-agency Shared Services in Public Safety

    Learn how effective inter-agency sharing can be accomplished through XML. The session will introduce the value of IBM DB2® pureXML for leveraging shared information in public security. It will review how Shandong Public Security takes advantage of DB2 pureXML technology to help business users (policemen) access this data. It will also demonstrate how they combine the advantage of both DB2 pureXML and IBM Info 2.0 to help business users discover direct and indirect links in data, and deliver the valuable intelligence to policemen’s daily activities.

1661 Streamline Government Processing Through Electronic Forms and DB2 pureXML

    Forms are everywhere; learn how converting to electronic forms can improve citizen access, streamline processes and provide better record-keeping. Hear about the value of IBM DB2® pureXML for storing, managing and analyzing the data created by electronic forms tools such as IBM Lotus® Forms. See a demo of a simple eForms application implemented with Lotus Forms connected to DB2 pureXML and the resulting simplicity of the database and query infrastructure required.

1677 Improving Health Care in China with a DB2 pureXML EMR Solution

    This session will articulate the value of IBM DB2® pureXML in an electronic medical record (EMR) solution, and how it improves health care in China, with a population of 1.4 billion. The IBM specialist, joined by an EMR specialist, will share their experiences with both DB2 pureXML and the medical specialist Cache Database. They’ll compare the two technologies, explain why pureXML was chosen, and describe how they migrated their Cache-based solution to DB2 pureXML. The speakers will discuss the business value of their new EMR solution and its business implications for the emerging health-care markets.

1197 DB2 pureXML Production Experiences at UCLA

    Last year we worked with IBM Watson Lab to prototype the IBM DB2® V9 pureXML™ for our Patient Oriented Document System (PODS), the UCLA enterprise-wide patient record repository. In 2007 IOD we shared the prototyping approaches and the potential benefits of pureXML for managing the metadata. In this presentation we will share our production experiences using pureXML to manage metadata for the new PODS4 as well as the actual benefits realized for the upgrade from XML Extender to pureXML. To appreciate the significance of pureXML’s impact on the PODS repository, we will briefly show PODS as a critical component of the UCLA Document Management System in the overall extended service oriented architecture (xSOA) that’s in progress.

1622 Top 10 Best Practices for DB2 pureXML

    IBM DB2® software offers IBM pureXML™ support with efficient XML storage, XML indexing, and query languages SQL/XML and XQuery. These are powerful but novel concepts in DB2. Although many existing guidelines for managing data in DB2 are also valid for XML, additional considerations can help you ensure that XML data is managed easily and efficiently. Based on experience with DB2 pureXML customer projects, this session presents the top 10 best practices for flexible and high-performance XML management with DB2 pureXML.

1678 DB2 pureXML Customers – Trends and Successes

    Come hear a series of real customer success stories involving IBM DB2® pureXML™ technology. Industries represented include financial, health, retail, telecom and government. For each customer, we will discuss its problem, the business value of DB2 pureXML and the technical solution.

Don’t forget there are also meet-the-expert sessions, birds-of-a-feather sessions, and demo-til-you-drop sessions.

Do you want to hear native XML database users speak about their experiences?  Do you want to ask them questions?  If so, there is an event that may interest you.

IBM’s Information on Demand conference is being held during the last week of October in Las Vegas.  This conference covers all aspects of information management, which of course is much broader than the management of XML data.

There are 29 sessions dedicated to native XML data management, including:

  • 8 sessions where DB2 pureXML customers or consultants talk about their experiences
  • 8 technical sessions, where you learn about the technical details of DB2 pureXML
  • 8 hands-on lab sessions, where you get to use the software to complete mini projects 

And you will be particularly interested in:

  • 2 Birds-of-a-Feather sessions, where DB2 pureXML users get together to talk about their experiences
  • Meet-the-Expert sessions, where you schedule 1×1 time with a DB2 pureXML expert to ask them whatever questions you want

And this is only the native XML data management part of the conference.  The conference has much more to offer, including great opportunities to learn about all aspects of information management, to network with fellow professionals, and of course to be entertained. 

One last thing… if you do go, keep an eye out for a free “DB2 pureXML for Dummies” booklet that will be handed out at the conference.  It will be handed out at the DB2 booth in the Expo floor and at my speaking session.

A quick note to let you know about a couple of online communities for DB2 pureXML:

  • LinkedIn Group
    Are you a member of LinkedIn?  If you are, there is a LinkedIn group dedicated to DB2 pureXML.  This is an great opportunity to network with other users from around the world on the IBM DB2 pureXML Network.
  • User Forum
    Do you have technical questions about DB2 pureXML?  If you do, then there is a new forum on IBM developerWorks.  Make sure to go to the DB2 pureXML Forum to ask your questions, post your tips, and read answers to other people’s questions.