If you perform data-oriented queries on XML document, then your queries typically contain predicates on dates, numbers, names, zip codes, and other short values. For such queries you define regular XML indexes to speed up the predicate evaluation. You can read about that in the article Exploit XML indexes for XML query performance in DB2 or in Chapter 13 of the DB2 pureXML Cookbook.

In contrast, there are also situations where you may want to perform text-oriented XML queries. In this case your queries may have to check whether a certain XML element, or a whole document, contains a certain search term. You can also use wildcards to perform substring or prefix searches.

Below you find a simple and commented example of how to get started with DB2 Text Search for XML Data. DB2 Text Search is powered by the Omnifind search engine, which is based on Lucene. I have developed the examples below on DB2 Express-C 9.7 for Windows, which is freely available.

First we create a database and a table with a couple of XML documents:

-- At the OS prompt, issue the following commands to create and connect to a database:
db2 create database mytest
db2 connect to mytest

-- Let's tart the DB2 Command Line Processor to
--  create and populate a sample table with two XML documents:
db2 -t

-- Issue the "create table" and "insert" SQL statements:
create table order(order_id integer not null primary key, doc XML);

insert into order values (1000,
'<Order OrderDate="2006-03-01" Status="Shipped">
   <item price="9.99">
      <partid>SNS9471B001</partid>
      <name>Snow Shovel, Basic 22 inch</name>
      <quantity>3</quantity>
   </item>
   <ShippingInstructions>Rush Delivery Requested</ShippingInstructions>
</Order>');

insert into order values (2000,
'<Order OrderDate="2006-04-15" Status="Unshipped">
   <item price="19.99">
      <partid>SSD9483B007</partid>
      <name>Snow Shovel, Deluxe 24 inch</name>
      <quantity>5</quantity>
   </item>
   <item price = "3.49">
      <partid>ICR3117F013</partid>
      <name>Ice Scraper, Windshield 4 inch</name>
      <quantity>5</quantity>
   </item>
</Order>');

Now we need to start the DB2 text search service and enable our test database for text search.

--At the OS command line, the following command starts the text search service:
db2ts start for text

-- For convenience, set an enviroment variable to indicate the
-- database to which text search administration commands are applied.
-- Depending on your OS, use the SET or the EXPORT command:
SET DB2DBDFT=mytest

-- The next step is to enable the database 'mytest" for text search,
-- requires the DB2 instance owner to have DBADM authority. The way
-- my DB2 instance is installed on Windows, I need to grant admin
-- rights to "system" and "localsystem":
db2 grant dbadm on database to system
db2 grant dbadm on database to localsystem

-- Now enable our database "mytest" for text search. This creates tables
-- and triggers for housekeeping information.
db2ts enable database for text

-- Create a text index on the XML column of the order table. The
-- CREATE command only defines the text index but does not populate
-- it. Therefore we also issue the UPDATE command:
db2ts "CREATE INDEX ordertextindex FOR TEXT ON order(doc)"
db2ts "UPDATE INDEX ordertextindex FOR TEXT"

-- You can also configure peridic automatic updates of the text
-- index (omitted here).

Now we are ready to try some text search queries, using SQL/XML or XQuery. Let’s start with SQL/XML. The specific syntax of the search expressions within the contains function allows DB2 to use the text index. Compare each query and its result to the two sample XML documents that we inserted above.

-- (1) Identify all orders that contain special shipping instructions.
-- This a structural search, because it looks for the existence of
-- the element <ShippingInstructions> regardless of its value.

SELECT order_id,
       xmlcast(xmlquery('$DOC/Order/@OrderDate') as Date) as odate
FROM order
WHERE contains(doc, '@xpath:''/Order/ShippingInstructions''')=1;

ORDER_ID    ODATE
----------- ----------
       1000 03/01/2006

  1 record(s) selected.


-- (2) Identify all orders for any kind of shovels, i.e. orders where
-- any item name contains the word "shovel".

SELECT order_id, xmlcast(xmlquery('$DOC/Order/@OrderDate') as Date) as odate
FROM order
WHERE contains(doc, '@xpath:''/Order/item/name[. contains("Shovel")]''')=1;

ORDER_ID    ODATE
----------- ----------
       1000 03/01/2006
       2000 04/15/2006

  2 record(s) selected.


-- Use a wildcard (*) to identify all orders that contain an item
-- whose "partid" starts with "SNS9":

SELECT order_id,
       xmlcast(xmlquery('$DOC/Order/@OrderDate') as Date) as odate
FROM order
WHERE contains(doc, '@xpath:''/Order/item/partid[. contains("SNS9*")]''')=1;

ORDER_ID    ODATE
----------- ----------
       1000 03/01/2006

  1 record(s) selected.


-- Identify all orders that contain an item whose "partid"
-- ends in "F013":

SELECT order_id,
       xmlcast(xmlquery('$DOC/Order/@OrderDate') as Date) as odate
FROM order
WHERE contains(doc, '@xpath:''/Order/item/partid[. contains("*F013")]''')=1;

ORDER_ID    ODATE
----------- ----------
       2000 04/15/2006

  1 record(s) selected.

-- On XML attributes, the text index also supports numeric searches, e.g:
-- Find all orders that have an item with a price greater than 10:

SELECT order_id,
       xmlcast(xmlquery('$DOC/Order/@OrderDate') as Date) as odate
FROM order
WHERE contains(doc, '@xpath:''/Order/item[@price > 10]''')=1;

ORDER_ID    ODATE
----------- ----------
       2000 04/15/2006

  1 record(s) selected.

-- But, if numeric searches are frequent, then you better use a regular
-- XML index of type DOUBLE.

The same kind of search conditions can also be performed in XQuery, using the function db2-fn:xmlcolumn-contains:


-- Return the order documents that include any item whose name
-- contains the word "shovel":
xquery
for $i in db2-fn:xmlcolumn-contains('ORDER.DOC',
                '@xpath:''/Order/item/name[. contains("Shovel")]''' )
return $i

-- Return the individual items whose partid starts with "ICR3". The text
-- search predicate, and hence the text index, is used to quickly
-- identify the orders that contain at least one matching item. In the
-- return clause we then use the XQuery starts-with() function to not
-- return ALL items from the matching orders, but only those items
-- whose partid starts with "ICR3". This is because a single order
-- document can contain matching and non-matching items for our search.
xquery
for $i in db2-fn:xmlcolumn-contains('ORDER.DOC',
                '@xpath:''/Order/item/partid[. contains("ICR3*")]''' )
return $i/Order/item[starts-with(partid,"ICR3")];

DB2 Text Search also provides many advanced text search features such as:

  • wildcard search
  • scoring and ranking of search results
  • fuzzy search
  • weighting or boosting of search terms
  • stemming (search for any stemmed form of a word)
  • proximity search
  • synonym search
  • linguistic search
  • etc.

Check out the following resources:

If you are new to using XML in DB2, or if you have colleagues that are looking for an introduction to DB2 pureXML, make note of the following (free) opportunities.

DB2 pureXML Webinar

On February 24 there will be a live Webinar titled “XML and DB2 pureXML for Beginners“. It starts at 1pm US Eastern Time (5pm GMT) and takes about 1 hour. The presenter is my colleague Bryan Patterson, and I will be also be there to answer questions. At the following URL you can find more information and a link to register:
http://www.ibm.com/developerworks/data/events/imbriefings.html

The actual registration page currently says “Duration 2h30min” but that’s incorrect – it’ll be just about an hour or so.

DB2 pureXML Bootcamps

If you want to learn DB2 pureXML in more detail, including XPath, XQuery, SQL/XML, XML indexes, XML compression, and so on, consider attending one of IBM’s free DB2 pureXML bootcamps. A bootcamp is basically a 2-day (or 3-day) course consisting of presentations and hands-on labs where attendees can gain or deepen their DB2 pureXML skills. It’s for application developers, architects, DBAs, and other interested folks.

For the next weeks and months, DB2 pureXML bootcamps are scheduled in Chicago, Toronto, New York (Manhattan), and a few other cities. The latest schedule and registration information can be found at the following URL:
http://www.ibm.com/developerworks/wikis/display/im/DB2+9.7+XML+Bootcamp

If you’d like to see a bootcamp scheduled in a city near you, please send me an email with suggestions (mnicola@us.ibm.com). No promises, but we’ll see what we can do!

In my previous post I talked about XML in the data warehouse. While using XML in a data warehouse may seem like a very novel idea to some database professionals, it is already a reality for others.

One example can be found at BJC Healthcare, one of the largest non-profit health care organizations in the USA. To give you an idea of the size of their operation: they run 13 hospitals, multiple community health facilities, and they have more than 26,000 employees and a net revenue of $3.2 billion. Their challenges and goals included improving the data collection required for grant applications, identifying suitable patients for medical studies, and improving patient treatment. The IT solution had to protect patient privacy and provide adequate performance and scalability to handle the increasing amounts of data.

Given the diverse and evolving nature of medical data, BJC decided to use XML as the format for patient medical records, lab results, and other clinical information. This decision provides two key benefits to their application. First, XML provides the flexibility that is required to handle variable data and future schema evolution. Second, the use of XML has allowed BJC to design a simple and intuitive database schema with less than 10 tables. Storing the same information in a fully relational database schema would have required over 100 tables and many queries would have to join 20 or more tables, which is complex and can often be inefficient.

At the beginning of the project, BJC evaluated DB2 as well as other databases that support XML data. BJC chose DB2 for a variety of reasons, including ease of use of the pureXML features in DB2 as well as a very short time to get up and running with a prototype.

The DB2 pureXML database at BJC currently holds about 2.5 terabytes worth of medical records in XML format. The browser-based application interface allows users to define complex analytical search requests that are automatically translated into XQuery and SQL/XML statements and submitted to DB2.

Read the full BJC case study for more details.