Schema Evolution

August 22, 2008

Yesterday, I briefly discussed schema flexibility.  Today, I’m going to talk briefly about schema evolution.  Schema evolution refers to the ability to easily move to a new version of an XML schema.  When evaluating vendors, make sure the database management system you choose meets your current and future schema evolution needs.

Consider a situation where an organization stores messages that adhere to one of the major XML standards, like HL7 or FpML. Industry standards are evolving, with new versions of those standards being made available over time. Moving to a new version of an XML standard usually means also moving to a new–and hopefully compatible–XML schema.

If the new version of the schema is compatible with the old version of the schema, you want to make sure that your database management system moves to the new schema with a minimal amount of disruption.  At the very least, they should support the ability to move to this new XML schema without:

  • Needing to re-validate all of your existing XML documents [… which would be a pain!)
  • Needing to change your existing XML documents [… which would be an even more painful experience]

And if the new version of the schema is not compatible with the old version of the schema, you want to make sure that your database management system supports schema flexibility to handle the situation.  This situation is sometimes referred to as uncompatible schema evolution.

Again, this is a topic that is often overlooked when evaluating XML storage needs, and one that has proven to be quite troublesome when overlooked.

Schema Flexibility

August 21, 2008

Often people overlook schema flexibility when evaluating their XML storage needs.  Before getting locked into an XML database decision, you should understand your XML schema needs and how the different XML databases support those needs.  After all, you don’t want to have a nasty surprise when implementing your application. 

An XML schema defines the structure of XML data. For example, an XML schema describes the actual XML elements and attributes that can appear in XML data, their data types, as well as describing where they can appear and how often they can appear. Often when working with XML data, you need to ensure that the XML data adheres to an XML schema. This is frequently referred to as checking that XML data is valid, or validating the XML data. Validating XML data is a good idea because it ensures that you will not encounter issues when working with the data.

Schema flexibility refers to the ability to cater for a wide range of XML schema needs; needs that go beyond simply associating a single XML schema with a database column.

Consider the situation encountered by tax authorities.  Although this scenario applies to any environment where data formats are subject to change.

An increasing amount of tax information is being filed using online forms.  Online tax forms are a more efficient and cost effective way to gather tax information.  XML is a natural choice as the format for such information because of its suitability for data exchange with other tax authorities.

Each year, there is the likelihood of minor changes to the tax forms. If the tax forms change, then it is likely that the XML schema will also change.  In such situations, simply using the new schema to validate all XML data in that column is not sufficient. Instead, you need to validate each record against the schema that was in existence when that record was created. In other words, you need the flexibility of having the cells in a database column validate against different schemas.

Another aspect of schema flexibility is the ability to enter XML data into a database without validating the XML data during the database insert.  You may want to validate the XML data as part of another step in your application.  To cater for such situations, you should ensure that your XML database management system does not force you to validate the data upon entry.  You would be surprised how often this situation arises.

These are only a couple of considerations to keep in mind.  The main point is to make sure to understand your XML validation needs, and to make sure that the XML database you choose meets those needs.

Tomorrow I’ll blog about XML schema evolution…

XML-based standards have emerged in many industries.  For instance, there is ACORD in insurance, FIXML in financial services, NIEM in government, and so on.

Are you evaluating options for persisting standards-compliant XML data?  If so, you should know about a great resource.  As you know, you can freely download IBM DB2, which is a data server for both relational and XML data.  Well, IBM has also made available working demos for a number of XML standards, including ACORD, FIXML, FpML, MISMO, NIEM, OTA, TAX1120, TWIST, UNIFI, and more.  The demos show end-to-end XML data exchange, together with data retrieval via RESTful Web services, Atom feeds, and XForms.

You can see the demos for yourself at:
http://services.alphaworks.ibm.com/DB2pureXMLDemo/Demo.html

And you can download sample data and demo scripts at:
http://www.alphaworks.ibm.com/tech/purexml/download

If you are looking for an introduction to the topics of storing and retrieving XML data, this free on-demand webcast is hopefully a good place to start:

On demand Webcast: A primer for storing and retrieving XML data

I have mentioned quite a few times how native XML storage makes life easier for database administrators and application developers. I thought I’d take a few moments to go into a little more detail.

By storing the XML directly in the database, you avoid the need to shred the XML into a relational schema. Shredding is also know as decomposition. Before shredding, you need to design a relational schema for the data. This can quite often be a laborious process. Sometimes this can be automated with off-the-shelf tools. However, you should keep in mind that the resulting tables will almost certainly need to be carefully examined, and possibly optimized. After designing the relational schema, you then need to set up the environment that actually maps the XML to the relational schema. And finally, you will need to develop and test code for using the data, which is typically quite complex because of the need for unwieldy SQL statements with multiple JOIN statements.

It sounds like there is a significant overhead for shredding XML data into a relational schema. But this is only a part of the story. You also need to consider what happens when the XML schema changes. And XML schema changes are an unfortunate reality for many of us. When the schema changes, it can play havoc with your relational schema, your mapping process, and the code for your applications that use the data. It is in dealing with these updates that many organizations are realizing the greatest gains from adopting native XML storage.

If you want a real-world example of the potential impact of dealing with relational tables for XML data, consider the FpML industry standard. With native XML storage, dealing with FpML messages is as straightforward as storing and retrieving the message. However, if you use shredding to store FpML messages, in some implementations you need to work with more than 475 separate database tables.

Some organizations have measured the impact of native XML storage on the workloads of database administrators and application developers. Here are some of the findings that I am aware of:

  • A large European financial services company use DB2 pureXML (pureXML is IBM DB2’s native XML storage feature) to reduce the number of lines of code for writing to and reading from their database by 65%. This reduces the amount of code they must develop, test, and maintain and allows their developers to be more productive.
  • This same company reduced the amount of time needed to develop services by 75%, thus allowing them to take on additional development projects and improve the services they provide.
  • They also determined that schema changes are now a breeze. In the past with shredding, adding a field took a day of work (development and test), and a week of real time because of the processes involved with database changes. Now all that is needed is to change the pointer to the schema in a DB2 XML configuration file, which takes 5 minutes.
  • A leading chemical company determined that by using the IBM database tools and pureXML technology they are saving 50-75% on development costs, while improving developer productivity by 25-50%.
  • And finally, a major American financial services company studied their environment and determined that by using the native XML storage in DB2 pureXML they reduce their resource requirements by 30%.

In any business, these are improvements that cannot be ignored.

XML is unchallenged as a data exchange format. Most industries have developed standards that use XML as the data format. The list of standards is seemingly endless: ACORD, FIXML, FpML, MISMO, HL7, NIEM, OTA, SVG, TAX1120, UBL, XBRL, and so on.

Your immediate inclination is likely to store such information in its native XML format. After all, when the information is exchanged, it is expressed in XML format. Therefore it seems natural that it would also be stored in a native XML format. Often this is the case. However, it is not always the case. Sometimes it makes more sense to store the information in relational form. This blog post discusses when to use the two approaches.

The key question that you need to ask yourself is whether the information is best expressed in a relational format or in an XML format. Because XML is so popular as a platform-neutral and transparent data exchange format, data is often sent as XML even when XML is not the ideal format for the information. For instance, information is sometimes force-fitted into a hierarchical data structure even when it has no hierarchical characteristics. If this is the case, examine the data to see if it should in fact be stored in a relational format. The general rule is that you should consider a relational structure when the data is better described in tabular format. This is the characteristic that most readily defines the relational schema.

There are also situations when you may be compelled to use a relational schema. For example, if the application that consume this information work only with relational data, then you need to take this into consideration.

You should store information in XML format when:

  • The information is best represented in a hierarchal format.

    When you store hierarchical information in a relational database, you often require a large number of tables. Creating and maintaining such systems is complex and costly. Multiple joins result in complicated SQL statements. In such cases, you should instead use native XML storage.

  • The schema is subject to change.

    Because relational tables are highly-structured, changing the schema is both difficult and costly. Updating a schema for natively stored XML, on the other hand, is a relatively painless exercise. If you expect the schema to change, you should use native XML storage.

  • There are many null values in the relational tables.

    Relational database tables set aside storage for each column. If your information does not populate all of these columns, you can have a large number of null values in your database. This, in turn, affects both storage requirements and performance. Again, you should consider native XML storage in such situations.