Tuning Database XQuery Statements (1)

I had a question of a colleague to have a look at a statement that didn’t really perform. So I had a look and tried to explain afterwards how my mindset worked during solving the problem at hand. I post it now here as well hoping it might help others.

Context!

As once somebody mentioned, “every database is unique”, and in this case mine is probably as well, but I can layout the general steps. First its probably good to know that when dealing with this SQL/XQuery, it referenced a table with more than 10 million XML documents in a column.

NOTE

Be aware, its all about context.

So what is the context for this query:

  • Oracle Enterprise Edition 11.2.0.3.0
  • XMLType column with 10+ million rows, 500+ GB in size
  • XMLType column defined via XMLType Object Relational storage
  • XMLType column defined with a XML Schema that defined via a “Super Schema” (XSD) setup multiple different content structures
  • Highly structured XML content (therefore Object Relational storage)
  • Rarely updated XML content (being / containing metadata content about the actual XML documents)

What this means “regarding context” is that, if it would be a 11.2.0.1 database, containing a different XML parser, then Optimizer costs and/or XVM costs would be different. What this means “regarding context” is that, if it would be a 11.2.0.4 database, containing a different XML parser optimizations, then Optimizer costs and/or XVM costs would be different, that is faster…

NOTE

If (still) applicable, pick the latest possible database version

XQuery database optimization will also differ, is dependent on the storage in use. In this case XMLType Object Relational, but in principle you have the following options…

  • XMLType – In memory (XVM)
  • XMLType – Basicfile CLOB (deprecated!)
  • XMLType – Securefile Binary XML
  • XMLType – Object Relational
  • CLOB (stupid, yes, but still often used…)

Also what counts is if the XML content is via its XMLType storage solution, XML Schema based or XML Schema less. If an XML Schema is in place, XML validation will occur and being an extra step in the process, it also will cost extra performance. You might now think, “Okay, I want to be flexible, so no XML Schema validation”, then think about it if you really want this. Imagine a column with 10, maybe 100 million XML documents. How do you check if the XML structure is what you would expect; The business would expect it to be…and if not so…how do you retrieve its contents via XPath performance efficiently…

Regarding performance and choosing the proper XMLType storage for your business case, this has been described by the XML DB Development team in a very good whitepaper:

Using CLOB columns are usually only “acceptable” for storing XML in the use case of maintaining document fidelity. But just like the topic of database high availability business requirements; it is most of the time not actually 99.99999% and/or if indeed actually needed by the business. If not only, people start to notice: it will come with great expenses…

In the context of “XML”, updates or content driven query statements will become very performance intensive.

NOTE

Choose the correct XMLType storage / datatype.

  • If you pick the wrong storage (datatype) for storing XML, you will get into (performance) trouble.

In principle the above is nothing new. If you choose for example to store a date value in a varchar2 column, it also will become very difficult for the database to optimize all date related SQL statements. This reasoning also counts for XML. The fact that you have multiple choices makes, on the one hand, your final decision a bit more difficult (to implement), but on the other hand, also provides you with more options to address your specific needs. But be aware (at least that is my opinion)…

NOTE

The more “flexible” the chosen XMLType datatype, the slower the performance if it comes to querying specific XML content

Mentioning In-Memory Column Store Database functionality“, of course after choosing the proper storage you will have to provide / create indexes to support your statements and DML. Oracle created, based on customer experiences, a second whitepaper that will provide you with a lot of information on the “how/what and where” when dealing with or creating XML Indexes.

On this website there is also a XML Content page with multiple posts about this subject which will provide you with some shortcuts.

NOTE

Support your queries with the proper XML Indexes, eg.

  • Structured XMLIndex
  • Unstructured XMLIndex
  • In-Memory Column Store supported indexes
  • XML Indexes on XML partitioned content

In the case of XMLType Object Relational, if created by Oracle by default, multiple indexes will be created to maintain the references between the structures (nested tables, varrays, etc) that are needed to uphold the XML structured defined by the used (registered) XML Schema. In case of using, for example, XMLType Securefile Binary XML storage, you will have to created the needed XMLIndexes for support on semi-structured or structured data, yourself. If needed with additional Oracle (XML datatype aware) Text indexes for unstructured XML content.

In my environment, we make use of a XMLType Object Relational defined column, restricted by a database registered XML Schema and were needed added with extra supporting indexes on the element or attribute values.

The information given above was needed to give you some context on how to deal with the following and its results and how it may or differ in your situation. The example given is a solution. With so many ways of actually making use of XQuery, XPath and SQL, there might be multiple ways (mix) of dealing with the problem. It really “all depends”… BUT hopefully you can apply / follow the some or the same outlines to create a (performance) satisfying result.

Follow me on my mind scribbles trying to tune the statement…

The XQuery Challenge

So in short, our context for this tuning adventure:

  • Oracle database version 11.2.0.3.0
  • XMLType Object Relational storage
  • Multiple XML indexes in place on node level

Let’s have a look at the statement:

--
-- cost 860477
--
SELECT MAX(to_number(extractvalue(m.metadata,'/opslag-metadata/dossierred/publicatienr'))) AS MAX
  FROM ic_metadata m
 WHERE extractvalue(m.metadata,'/opslag-metadata/dossierred/uitgave/@code') = '92'
   AND extract(YEAR FROM extractvalue(m.metadata,'/opslag-metadata/dossierred/publicatiedatum')) = 1964
;

As hinted a bit before, there are multiple “storage metadata” definitions due to this “Super/Uber XML Schema” setup. One of those is “dossierred”. From the in total (currently) 10.875.688 XML documents in table IC_METADATA, column METADATA (XMLType OR), of those, 403.977 are defined “dossierred”. The statement tries to find the maximum “publicationnr” (number) in those “dossierred” XML documents where the “code” attribute matches string ’92’ (defining specific “dossierred” content), but only where the “publicationdatum” (date) is in the year 1964.

The plan table output showed the following (updated statistic(s) numbers have been created) …

As most of you already expected a conversion operation before the equality sign instead of afterwards, is not a good idea and forced the database to do a full table scan on IC_METADATA. For the XMLDB savy amongst you, you also will notice the use of index organized table info (IOT). This is due to using the IOT option in the DBMS_XMLSCHEMA.REGISTER_SCHEMA (parameter OPTIONS / “REGISTER_NT_AS_IOT”). This is nowadays not the default anymore as it once was.

Check, if needed, the Oracle XMLDB Developers Guide for your database version in use.

REGISTER_NT_AS_IOT – Store nested tables created during schema registration as index organized tables. The default is to store nested tables as heap tables

If you also want this, be aware that you can not create XML partitioned content on such XMLType Object Relational structured content…

NOTE

Don’t use deprecated functionality (if not really really necessary)!

  • make use of (performance) improvements made in versions / patch levels
  • make use of current XML W3C standards
  • make use of latest implemented XML standards

So because of the used deprecated SQL/XML functions, I started to rewrite the statement with XMLTABLE and XMLEXIST

--
EXPLAIN PLAN FOR
SELECT MAX(to_number(publicatienr)) AS MAX
  FROM ic_metadata m 
     , XMLTABLE(xmlnamespaces(DEFAULT 'http://www.somewhere.nl/metadata'), 
                  '$p/opslag-metadata/dossierred' 
                  PASSING m.metadata as "p"
                  COLUMNS 
                    publicatiedatum  DATE           PATH 'publicatiedatum'
                  , uitgave_code     VARCHAR2(4000) PATH 'uitgave/@code'
                  , publicatienr     VARCHAR2(4000) PATH 'publicatienr'
                 ) x1 
 WHERE XMLEXISTS ('declare default element namespace "http://www.somewhere.nl/metadata"; (::)
                  $p/opslag-metadata/dossierred'
                  PASSING m.metadata as "p" 
                  )
   AND x1.uitgave_code = '92'
   AND x1.publicatiedatum >= to_date('01-01-1964', 'DD-MM-YYYY') 
   AND x1.publicatiedatum < = to_date('31-12-1964', 'DD-MM-YYYY');
--
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
-- 

After fiddling a while with the initial statement, I came up with the above statement. I need XMLEXIST to filter on XML Schema based "dossierred" sections only and I want to use XMLTABLE to get the values for "code", "publicatienr" and "publicatiedatum".

Let me explain, regarding "my initial thought processing" already went into it...

  1. XMLEXIST has been used to filter on those XML Schema based XML "dossierred" XML instance sections.
  2. XMLTABLE has been used to get the needed element and attribute values
  3. I tried to avoid the XMLType "PASSING" clause a much as possible
  4. I tried to be as specific as I could traversing with XPath (do not use // XPath traversal etc)

The first two are probably obvious, but why point 3. and 4., you might ask...

Regarding 3...

Every time you use "XMLTYPE", and Oracle is not able to avoid unnecessary use, via re-writing and other means of dealing with optimization of the statement, the database will check the XML content for XML well-formed-ness. The bigger the (intermediate) XML content, the bigger the amount of checking/parsing effort needed. I guess you can imagine, if the XML instance or intermediate result is big, let's say 10MB in size, that this also comes with a nasty CPU drain/usage and PGA memory consumption. So if not needed, avoid it.

NOTE
Avoid XML parsing of XMLTYPE

  • Parsing of XML, even for well formedness is expensice, try to avoid it

Regarding 4...

Be as specific as can be regarding telling the XML parser engine the exact XPath path. This will keep the (intermediate) XML instance result small (and CPU usage / PGA usage) and performing. If you would use for example the "//" XPath notation, you are more or less telling the XML parser "I don't know where you have to search, try a like kind of operation to find the values needed". In SQL this would be the equivalent of "like '%....%' resulting in a full table scan. In XML land, this forces the full XML document to be searched...

NOTE
If you need to parse XML results, keep it small

  • Keep intermediate results small to be efficient regarding CPU and PGA, use.

The plan table output showed the following for the mentioned statement

Capture

Awesome, you might say? Cost went down from initially 860477 to 796 so that is very promising regarding performance.

There is this annoying "Unoptimized XML construct detected" remark though. The filter operation in 2 showed a extensive ugly cast. So I checked the data specification of the COLUMNS section in the XMLTABLE function. In the case of XMLType Object Relational storage, column datatype specification is already know due to the created database objects in the database defined by its used / registered XML Schema.

The trick I use here, is to temporarily remove the datatype specifications in the XMLTABLE COLUMNS section and use the remaining in a "CTAS" statement. This way, the database will create the appropriate column and datatype specifications, needed in the COLUMNS XMLTABLE section.

--
CREATE TABLE TEST
AS
SELECT x1.publicatiedatum
     , x1.uitgave_code
     , x1.publicatienr
  FROM ic_metadata m 
     , XMLTable  (xmlnamespaces(DEFAULT 'http://www.somewhere.nl/metadata'), 
                  '$p/opslag-metadata/dossierred' 
                  PASSING m.metadata as "p"
                  COLUMNS 
                    publicatiedatum PATH 'publicatiedatum'
                  , uitgave_code    PATH 'uitgave/@code'
                  , publicatienr    PATH 'publicatienr'
                 ) x1 
 WHERE XMLEXISTS ('declare default element namespace "http://www.somewhere.nl/metadata"; (::)
                  $p/opslag-metadata/dossierred'
                  PASSING m.metadata as "p" 
                  )
 AND 1=2                  
;
--
DESCRIBE TEST
--
Name            Null Type                
--------------- ---- ------------------- 
PUBLICATIEDATUM      DATE                
UITGAVE_CODE         VARCHAR2(4000 CHAR) 
PUBLICATIENR         VARCHAR2(128 CHAR) 
--

Notice in the above, that none of the element or attributes in COLUMNS have a DATE/VARCHAR or NUMBER specification anymore…

NOTE
Use the correct datatype (specification) in XMLTABLE (and others)

  • Use the datatype needed/(XML)CAST the proper specification
  • The CTAS, COLUMN/PATH trick to find the datatype specification
  • In SQL*Plus (only) you can check datatypes via “describe” on XMLTYPE Object Relational content

Re-running the DBMS_XPLAN output now shows for the statement below a cost of 796

--
EXPLAIN PLAN FOR
SELECT MAX(to_number(x1.publicatienr)) AS MAX
  FROM ic_metadata m 
     , XMLTable  (xmlnamespaces(DEFAULT 'http://www.somewhere.nl/metadata'), 
                  '$p/opslag-metadata/dossierred' 
                  PASSING m.metadata as "p"
                  COLUMNS 
                    publicatiedatum  DATE           PATH 'publicatiedatum'
                  , uitgave_code     VARCHAR2(4000) PATH 'uitgave/@code'
                  , publicatienr     VARCHAR2(128)  PATH 'publicatienr'
                 ) x1 
 WHERE XMLEXISTS ('declare default element namespace "http://www.somwhere.nl/metadata"; (::)
                  $p/opslag-metadata/dossierred'
                  PASSING m.metadata as "p" 
                  )
   AND x1.uitgave_code = '92'
   AND x1.publicatiedatum >= to_date('01-01-1964', 'DD-MM-YYYY') 
   AND x1.publicatiedatum < = to_date('31-12-1964', 'DD-MM-YYYY') 
;
--

So what would happen regarding cost if we CAST to number in the COLUMNS TABLE section for "publicatienr" instead of in the SELECT section. We have to do this anyway, trying to determine the maximum...

--
EXPLAIN PLAN FOR
SELECT MAX(x1.publicatienr) AS MAX
  FROM ic_metadata m 
     , XMLTable  (xmlnamespaces(DEFAULT 'http://www.somewhere.nl/metadata'), 
                  '$p/opslag-metadata/dossierred' 
                  PASSING m.metadata as "p"
                  COLUMNS 
                    publicatiedatum  DATE           PATH 'publicatiedatum'
                  , uitgave_code     VARCHAR2(4000) PATH 'uitgave/@code'
                  , publicatienr     NUMBER(38      PATH 'publicatienr'
                 ) x1 
 WHERE XMLEXISTS ('declare default element namespace "http://www.somwhere.nl/metadata"; (::)
                  $p/opslag-metadata/dossierred'
                  PASSING m.metadata as "p" 
                  )
   AND x1.uitgave_code = '92'
   AND x1.publicatiedatum >= to_date('01-01-1964', 'DD-MM-YYYY') 
   AND x1.publicatiedatum < = to_date('31-12-1964', 'DD-MM-YYYY') 
;
--

Alas, the overall cost figure keeps the same (796), so it looks like the CAST effort done in the XMLTABLE section is cost effective the same as if done in the SELECT section.

Can we improve by being more specific in the filter XMLEXIST section... Let's try and filter via the "code='92'" using it in a XPath bind variable.

--
EXPLAIN PLAN FOR
SELECT MAX(x1.publicatienr) AS MAX
  FROM ic_metadata m 
     , XMLTable  (xmlnamespaces(DEFAULT 'http://www.somewhere.nl/metadata'), 
                  '$p/opslag-metadata/dossierred' 
                  PASSING m.metadata as "p"
                  COLUMNS 
                    publicatiedatum  DATE       PATH 'publicatiedatum'
                  , publicatienr     NUMBER(38) PATH 'publicatienr'
                 ) x1 
 WHERE XMLEXISTS ('declare default element namespace "http://www.somewhere.nl/metadata"; (::)
                  $p/opslag-metadata/dossierred[uitgave/@code=$b1]'
                  PASSING m.metadata as "p"
                        , '92'       as "b1"
                  )
   AND x1.publicatiedatum >= to_date('01-01-1964', 'DD-MM-YYYY') 
   AND x1.publicatiedatum < = to_date('31-12-1964', 'DD-MM-YYYY') 
;
--

NOTE
The better the filter criteria in XMLEXISTS, the better performance

  • It keeps intermediate results small in size
  • It hopefully can use/access the indexes needed
  • …and avoid full XML content searching

It might not be much, but cost figures improved a bit and shows now 794 instead of 796. This is due to the fact that Oracle can make use now of both secondary indexes, the one on “publicatie” and on “uitgave”. Full table access / index access on IC_METADATA has now almost avoided.

Still to come, in the next post

  • Alternative XQuery approaches
  • Cheating via XMLType(O.R.) rewriting

HTH

Marco Gralike Written by: