Quering RSS Feeds The XMLDB Way

Actually this is old stuff (2006), but it got lost in a comment section. I think this can still be useful to some and I also post this here for prosperity.

Somewhere in 2006 my colleague Lucas Jellema wrote a post on the AMIS Technology site about querying rss feeds from the database. My colleague Anton Scheffer and I commented on that article with our XMLDB functionality mindset. Later on in 2007 Lucas wrote another useful post called “Querying RSS feeds in SQL…“. Peter Wolf commented that he had also written a very nice blog post about using XMLDB functionality while Integrating Yahoo Pipes into APEX.

I think this is still useful stuff to a lot of us, although you should keep in mind that the table(xmlsequence(extract())) construct will be, in time, out lived by the XMLTABLE function. Also XMLTABLE supports XPath V2, the table(xmlsequence(extract())) doesn’t. The XMLTABLE function is available from Oracle database version 10.2.

I reconstructed my statement in the comment section from the post in 2006 and “solved” the small puzzle that I couldn’t solve at that time. Of course, especially via XQuery, there are more good solutions, but the following is a first step.

Anton’s example that is still very useful from database version 9.2.0.3 up to 10.1 (if not only remembering the “old” Oracle 9.2 syntax).

SQL> SELECT VALUE( i ).EXTRACT( '/item/dc:creator/text()'
  2                           , 'xmlns:dc="http://purl.org/dc/elements/1.1/"' ).getstringval() creator 
  3  FROM   TABLE( XMLSequence( HTTPURITYPE( 'http://technology.amis.nl/blog?feed=rss2' ).getXML().EXTRACT('//item') ) ) i 
  4  ;
 
CREATOR
--------------------------------------------------
Marco Gralike
Lucas Jellema
Lucas Jellema
Aino Andriessen
Lucas Jellema
Lucas Jellema
Lucas Jellema
Lucas Jellema
Lucas Jellema
Michiel Jonkers
 
10 ROWS selected.

Although my feeling is here that you should go for a extract(xmltype) construct instead of xmltype.extract (the latter is actively discouraged by Oracle).

The advantage of my statement in 2006 is that, performance wise, it is smart to solve as much problems as possible in the XPath section before producing XML fragments, that have to be handled via the EXTRACT function. Most of the time XML fragments will be dealt with in memory and are very resource intensive regarding CPU and memory.

SQL> SELECT * 
  2  FROM xmltable('//item' 
  3                 passing HTTPURITYPE('http://www.liberidu.com/blog/?feed=rss2').getXML() 
  4                 COLUMNS title            varchar2(4000) path '/item/title/text()', 
  5                         link             varchar2(4000) path '/item/link/text()', 
  6                         publication_date varchar2(4000) path '/item/pubDate/text()', 
  7                         creator          varchar2(4000) path '/item/*[namespace-uri()="http://purl.org/dc/elements/1.1/" 
  8                                                                      and local-name()="creator"]/text()', 
  9                         description      varchar2(4000) path '/item/description/text()',
 10                         category         XMLTYPE path '/item/category/text()' 
 11  );
 
TITLE
--------------------------------------------------------------------------------
LINK
--------------------------------------------------------------------------------
PUBLICATION_DATE
--------------------------------------------------------------------------------
CREATOR
--------------------------------------------------------------------------------
DESCRIPTION
--------------------------------------------------------------------------------
CATEGORY
--------------------------------------------------------------------------------
Quering RSS Feeds The XMLDB Way
http://feeds.feedburner.com/~r/Bloggralikecom/~3/319836447/
Wed, 25 Jun 2008 16:47:19 +0000
Marco Gralike
Actually this IS OLD stuff (2006), but it got lost IN a comment SECTION. I think
 this can still be useful TO SOME AND I also post this here FOR prosperity.
Somewhere IN 2006 my colleague Lucas Jellema wrote a post ON the AMIS Technology
 site about querying rss feeds FROM the DATABASE. My colleague [...]<img src="ht
tp://feeds.feedburner.com/~r/Bloggralikecom/~4/319836447" height="1" width="1"/>
< ![CDATA[11g]]>< ![CDATA[General]]>< ![CDATA[Howto]]>< ![CDATA[Oracle]]>< ![CDATA[RD
BMS]]>< ![CDATA[XMLDB]]>< ![CDATA[XSQL, XPath, XQuery]]>< ![CDATA[httpuritype]]>< ![
CDATA[rss]]>< ![CDATA[xmlnamespaces]]>< ![CDATA[xmltable]]>< ![CDATA[Xpath]]>< ![CDA
TA[XQuery]]>
 
Wordpress LightBox Plugin Trouble - Part II
http://feeds.feedburner.com/~r/Bloggralikecom/~3/319645225/
Wed, 25 Jun 2008 11:42:53 +0000
Marco Gralike
I just couldn&#8217;t find it. I searched a lot ON the internet but couldn&#8217
;t find a solution FOR my problems described IN my post &#8220;Wordpress Plugin
Trouble&#8220;. So hopefully am rid OF it now, while switching back TO the OLD v
ersion off Lightbox JS v2.2. Hopefully this won&#8217;t cause any issues anymore
 IN Windows Internet [...]
< ![CDATA[General]]>
 
...
.etc.

You can clean it a little bit up by addressing the namespace reference issue via the following syntax:

SQL> SELECT *
  2  FROM xmltable(XMLNAMESPACES('http://purl.org/dc/elements/1.1/' AS "MGR"),
  3                 '//item'
  4                 passing HTTPURITYPE('http://www.liberidu.com/blog/?feed=rss2').getXML()
  5                 COLUMNS title            varchar2(50) path '/item/title/text()',
  6                         link             varchar2(50) path '/item/link/text()',
  7                         publication_date varchar2(50) path '/item/pubDate/text()',
  8                         creator          varchar2(50) path '/item/MGR:creator/text()',
  9                         description      varchar2(250) path '/item/description/text()',
 10                         category         XMLTYPE      path '/item/category/text()'
 11  );
 
TITLE
--------------------------------------------------
LINK
--------------------------------------------------
PUBLICATION_DATE
--------------------------------------------------
CREATOR
--------------------------------------------------
DESCRIPTION
--------------------------------------------------------------------------------
CATEGORY
--------------------------------------------------------------------------------
Quering RSS Feeds The XMLDB Way
http://feeds.feedburner.com/~r/Bloggralikecom/~3/3
Wed, 25 Jun 2008 16:47:19 +0000
Marco Gralike
Actually this IS OLD stuff (2006), but it got lost IN a comment SECTION. I think
 this can still be useful TO SOME AND I also post this here FOR prosperity.
Somewhere IN 2006 my colleague Lucas Jellema wrote a post ON the AMIS Technology
 site about q
< ![CDATA[11g]]>< ![CDATA[General]]>< ![CDATA[Howto]]>< ![CDATA[Oracle]]>< ![CDATA[RD
BMS]]>< ![CDATA[XMLDB]]>< ![CDATA[XSQL, XPath, XQuery]]>< ![CDATA[httpuritype]]>< ![
CDATA[rss]]>< ![CDATA[xmlnamespaces]]>< ![CDATA[xmltable]]>< ![CDATA[Xpath]]>< ![CDA
TA[XQuery]]>
 
Wordpress LightBox Plugin Trouble - Part II
http://feeds.feedburner.com/~r/Bloggralikecom/~3/3
Wed, 25 Jun 2008 11:42:53 +0000
Marco Gralike
I just couldn&#8217;t find it. I searched a lot ON the internet but couldn&#8217
;t find a solution FOR my problems described IN my post &#8220;Wordpress Plugin
Trouble&#8220;. So hopefully am rid OF it now, while switching back TO the OLD v
ersion off
< ![CDATA[General]]>
 
Farwell TO One OF the Best Goalkeepers We Ever Had
http://feeds.feedburner.com/~r/Bloggralikecom/~3/3
Sat, 21 Jun 2008 22:31:09 +0000
Marco Gralike
Edwin van der Sar LAST championship game has been played. Again AS &#8220;Man OF
 the Match&#8221; regarding the Dutch team; brilliant AND needed saves ALL OVER
the place during the game against Russia. His FIRST championship MATCH was IN 19
95 AND unt
< ![CDATA[General]]>

HTH

😎

Marco Gralike Written by:

7 Comments

  1. June 26

    Hi Marco,

    I used the extract method too and tried to rewrite to what you suggest, but get an error: SQL command not properly ended. I’m trying on 10.1. Is it supposed to work on there?

    Thanks,
    Dimitri

  2. June 26

    XMLTABLE is a new xml operator you can use from Oracle 10g Release 2 and upwards. My advise would be to use it from 10.2.0.3 and upwards, because the xquery functionality has been significantly improved in this patch level.

  3. January 27

    Hi Marco!
    It seems that lately the Oracle Forums RSS software has something inside that is causing the HTTPURITYPE.getXML method to fail, allegedly reporting HTTP 500 error. The curious bit is in that accessing the same RSS source with a normal RSS reader, the operation succeeds without any problems, so i started wondering what could be the possible cause of this. This started happening about around mid January and is database version independent as i’ve tried with different Oracle versions and it was confirmed by another guy who is apparently attempting to do the same elsewhere.
    I wonder if someone introduced some filtering at the level of the user agent in the forum software since that date.

    You can easily check this out by attempting to execute the following query:

    select httpuritype(‘http://forums.oracle.com/forums/rss/rssmessages.jspa?forumID=137’).getCLOB() x
    from dual;

    Any ideas?
    I hate to think that i shall need to rewrite this built-in functionality just to spoof the user agent.
    Note that also the utl_http.request_pieces is affected by the same problem.

    Flavio

  4. February 4

    Quick update: i replaced Oracle built-in getClob method with mine, that i rewrote from the ground up, and guess what? it works great!

    Flavio

  5. February 4

    Maybe worth a small blog post? So I also can reference it here?

  6. June 29

    Had similar problem.

    By adding a RSS news entry of 56 KB, select performance dropped from 2 seconds to more than one minute. Total RSS feed was then 240 KB.

    Obviously Oracle switches somewhere between varchar2 and clob implementation. And the CLOB implementation is not as efficient with some operations such as concatenation (‘||’).

    Changing ‘/item’ to:

    ‘//item[string-length(./content:encoded/text()) < 8000]'

    Fixed performance.

  7. July 3

    You should never use a // though, if you like performance that is, plus I wonder if content less than 8000 characters would always suffice…

Comments are closed.