Storing XML data in a relational database brings reliability, manageability, and other RDBMS strengths. But with clumsy storage methods (like shredding and CLOBs), application performance was once an open question. Two years after IBM introduced pureXML in DB2 9, that question has been answered.
When IBM DB2 9 debuted in 2006, its pureXML technology was widely recognized as a tour de force. But acceptance of the idea of a hybrid relational/XML DBMS wasn't necessarily a slam-dunk. The plusses were obvious: Integrating an XML engine with an established RDBMS would make the XML portion easier to incorporate into ongoing data operations and would enable application of many of the data management features of the relational system, such as security and archiving, to the XML data. But how would it perform? Considering the past performance of object-relational databases, hybrid systems in general have a mixed record.
Two years later, pureXML is having a performance impact not just in theory but in the real world.
You Want What From XML?
During the five years of development that went into pureXML and DB2 9, IBM took a long hard look at the nature of XML — its obvious flexibility and equally obvious verbosity — to find areas that needed significant improvement. IBM settled on a couple of performance factors: the storage model and query optimization.
XML data is represented by a hierarchical tree structure and typically has multiple layers and many, many nodes, making it a difficult format to optimize or index. It also cries out for some form of compression. Then there are XML queries, which can become extraordinarily complex very quickly.
One common way to store XML data is as a character large object (CLOB), which, like its cousin the binary large object (BLOB), is usually stored whole and with minimal preprocessing for indexing or query optimization. Another approach is to "shred" the XML into columns of relational data; this approach requires elaborate parsing techniques and typically uses a lot of storage space.
DB2 pureXML stores the data in a pre-parsed hierarchical format that reflects the underlying nested structure of the data. The format allows for sophisticated indexing, and compresses the data going to physical storage. When a query is executed, pureXML converts XQuery and SQL/XML queries into a unified form that is optimized in several ways (cross-language, query rewrite, optimal index, and cost-based). XML compression, query optimization, and blending of relational and XML handling (enhanced in DB2 9.5 for insert and update performance) result in improved database processing — 10X to 20X faster than similar operations in DB2 V8 or other DBMSs for that matter, according to IBM testing and real world examples (see the sidebar, "The Sweet Smell of Success").
A More Suitable Model for XML
The performance characteristics of pureXML are also a function of suitability. Key point: XML data doesn't replace relational data. It's hard to beat a relational database for standard financials, but an RDBMS has a tough time with published material, such as whole books, magazines, or journals. It also doesn't do well with applications where the data can be complex and hierarchical and still contain large amounts of unstructured information. Life sciences provide many examples, such as the current effort to move genomic data from various proprietary formats into XML. In some cases, the XML format provides optimized pathways to data that are very difficult to store and extract in a traditional RDBMS, which gives an XML database an obvious performance advantage from the start.
There's also the nontrivial effect of an XML DBMS talking directly to XML applications, through Web services or whatever method. Because XML is the "lingua franca" of the Internet, its use in inter-application communication is ubiquitous, and it makes sense to keep the whole chain (application — message — database) in XML format.
There are a lot of ways to tout the technical importance of IBM's pureXML (68 new patents!), but the real credibility of the IBM relational/XML model is how it performs in mainstream enterprise applications.
Real World XML Performance
Xicheng District Health Bureau of Beijing serves about one million people, a customer list that results in a large database filled with sensitive data. Health system requirements, combining whole documents, such as doctor's reports, with data points from numerous sources, create a situation that lends itself to the flexibility of XML data. As Zhu Shu Hong, CIO of the Bureau points out, "DB2 9 with pureXML is fit for not only the core database storage model, but also the raw data collection forms and data exchange." It's the complexity of representing an individual's health records that lets XML shine, says Mr. Zhu; XML provides a flexible query capability and quick response for real-time on-demand information, which is critical during some medical procedures. The Xicheng District Health Bureau paired its large DB2 9 database with IBM service-oriented architecture (SOA) technology, in particular, implementing a "service bus" for Web services that expose data for applications.
Next Page >>
Comments? Questions?
Give us your feedback or ask a question of the author.