HOWTO: Load Really Big XML Files

Sometimes you will want to load data from huge XML files into the database. So how do you achieve this?

There are more then one ways to achieve this, but most of the time a “SAX parser” is used. The term on Wikipedia for SAX is:

A Simple API for XML (SAX) is a serial access parser API for XML. SAX provides a mechanism for reading data from an XML document. It is a popular alternative to the Document Object Model (DOM).

The disadvantage sometimes of using DOM is that it uses to much resources in the sense of CPU and memory and for really huge files this method simply will not work in terms of performance. Parsing an XML document with DOM acquires the whole document to be loaded into memory before processing can be started. Via SAX only a small memory footprint is needed.

Read up on the Wikipedia pages what the advantages and disadvantages are while using SAX. So if you need to load a big XML document into an Oracle database like me, trying to upload the Mediawiki XML dumpfile, DOM validation will simply not work because of the limitations of my hardware.

A Mediawiki XML document has the following simplified structure:

< ?xml version='1.0' encoding='UTF-8'?>

  
    Wikipedia
    http://en.wikipedia.org/wiki/Main_Page
    MediaWiki 1.13alpha
    .
      
      
    
  
  
    AmericanSamoa
    6
    
      133452270
      2007-05-25T17:12:06Z
      
        Gurch
        241822
      
      
      Revert some text
      #REDIRECT a lot of text
    
  
  
    AppliedEthics
    8
    
      133452279
      2007-05-25T17:12:09Z
      
        Gurch
        241822
      
      
      Revert some text
      #REDIRECT a lot of text 
    
  
  
  
  
  
  
  
  
  
   7 million page elements 
  
  

The content of the complextype “page”, is the XML fragment that I am interested in. At current state a dump XML file, like the one from Mediawiki that is, approximately 17 Gigabytes in size (ASCII, uncompressed), containing roundabout 7 million “page” elements.

Loading XML data

So what do you need?

You will need:

  • A load utility based on a SAX parser
  • An XML document with the data you want to load
  • An XML schema that represents the structure of the XML document

For the Mediawiki data this can be downloaded, for instance you don’t have to go the very big English version, as shown here:

Creating a SAX Parser

While trying to replay Marcelo Ochea’s example about “Uploading Wikipedia Dumps to Oracle databases“, I was facing the problem of trying to create a utility based on SAX. The old documentation and source code once demonstrated on OTN, as described in my post “Oracle OTN SAX Loader Example“, is based on “very old” source and Java libraries. For example, while trying to compile the code, error messages displayed that some methods were deprecated and not contained anymore in the Java libraries. I have not enough Java programming experience (yet) to circumvent this, nor was my intention to spend to much time to achieve my goal: loading XML data into an Oracle database.

My second attempt was based on the Java code as described and delivered by Mark Drake on the OTN XMLDB forum FAQ thread. This Java code is currently up to date while using it with the current Java libraries. It is described by Mark and divided into the following sections. I derived the names from the main classes, not from the title’s Mark used:

  • BaseApplication.java
  • ConnectionProvider.java
  • DatabaseWriter.java
  • FileWriter.java
  • LoaderStatistics.java
  • ProcessingCompleteException.java
  • SaxProcessor.java
  • SourceProcessor.java

and a configuration file called:

  • connection.xml

Be aware that “hardcoded” connection default settings are set in “ConnectionProvider.java”.

I know how (see “Oracle OTN SAX Loader Example“) I can compile the command line, the Java development kit and Ultraedit as my main tooling, but this time I was persistent to try to learn how this can be achieved in Oracle’s JDeveloper environment.

I am using JDeveloper more and more, mainly for XML and XML Schema design and I hope/believe the XML parser is in sync with the one while using Oracle XMLDB functionality, but never had tried to compile a Java application. So being a newbie on this, I tried to make it work and in the end I succeeded. For all those out there in the same position, this is how it works.

I used Oracle JDeveloper 10g Studio Release 3 (10.1.3- Build 3673) for the following example.

  1. Download Oracle JDeveloper via OTN and start it up
  2. Click on “File” in the menu and the on “New”


  3. Click picture to enlarge

  4. In the category “General” be sure that the item “Application” is selected and click the “OK” button.
  5. A “Create Application” window will pop up where you can fill in details.
    • I named my Application “SAXLoader” and want the source code to be in
    • the directory “C:\SAXLoader for Oracle”. After this I clicked on the “OK” button.



    Click picture to enlarge

  6. The next window popped up for details about “Project Name” and “Directory Name”
    • Filled in for Project Name: “SAX Loader”
    • The Directory name automatically changed and that was good enough for me so left it that way (“C:\SAXLoader for Oracle\SAX Loader”)
    • Now click on the “OK” button again.



    Click picture to enlarge

  7. Now mark and right click the “SAX loader” project. You will be presented with an option for “New” in the menu.


  8. Click picture to enlarge

  9. While clicking on “New” a new window will present itself with options. Pick the category “General” and in “Item” the option “Java Class”.


  10. Click picture to enlarge

  11. A “Create Java Class” window pops up. Here I entered the “Name” field with the same name as the main class name in the source code. So I called the first class “BaseApplication” as it is mentioned in the BaseApplication(.java) source. I didn’t change anything else and clicked the “OK” button again.


  12. Click picture to enlarge

  13. In the “BaseApplication.java” sub window, I selected all, by JDeveloper, generated standard class code and replaced it with the BaseApplication code from OTN, by Mark Drake and saved this situation via “Crtl-S” (the shortcut for “Save” it)


  14. Click picture to enlarge



    Click picture to enlarge

  15. I repeated steps 7 till 9 for all other the other Java code.


  16. Click picture to enlarge

  17. I added the connection.xml configuration file by clicking (step 7 and 8 ) “File” instead off “Java Class” and clicked on the “OK” button.


  18. Click picture to enlarge

  19. A “Create File” window opened and I entered “connection.xml” as the “File Name” and clicked on the “OK” button.
  20. JDeveloper now created an connection.xml file (and started to complain about no “root” element and not “wellformed” xml). This is correct of course because the file doesn’t contain any XML yet. In the “connection.xml” window on the right I copy / pasted the configuration file details and altered the parameters to my needs (different database, hostname, etc) as the ones Mark described). And saved the file.


  21. Click picture to enlarge

  22. Because we still some needed Java libraries, JDeveloper starts to complain about missing, for instance, OCI and SAXParser code so we have to add those Libraries. So right click on the “SAX Loader” project and the click on “Project Properties”. A new “Project Properties” Window will pop up were we choose “Liberaries” in the left pane and click on the right pane “Add Library” button to add the library for “Oracle XML Parser v2”.


  23. Click picture to enlarge



    Click picture to enlarge



    Click picture to enlarge

  24. In the end I added the libraries “Oracle XML Parser v2” and “Oracle JDBC”.


  25. Click picture to enlarge

  26. If you “Rebuild” or “Make” your SAX Loader project it now “only” complains about naming conventions (“Warning name X does not match source file name”). You could fix that, but I didn’t (yet).
  27. The first time you actually run the application it will ask you to “Choose the Default Run Target”. I have set it to the “SaxProcessor.class” file.


  28. Click picture to enlarge

During the first “make” it came up with “Successful compilation: 0 errors, 0 warnings.”. So I guess I am set to go.

😎

Preparing a Load Run

If you have a look at the configuration “connection.xml” file that controls some process parameters then you will have in the end something like the following:

Content of connection.xml

< ?xml version='1.0' encoding='windows-1252'?>

  Thin
  10.252.252.102
  1521
  ONTW
  ONTW
  DEDICATED
  wiki
  wiki
  C:\wiki\enwiki-latest-pages-articles.xml
  page
  WIKI_STAGE
WIKI_STAGE_ERRORS xsi http://www.mediawiki.org/xml/export-0.3/ 500 4

You will have the following options (see also the description on the OTN XMLDB Forum by Mark Drake):

  • Driver can be Thin or OCI
  • Schema and Password are the database schema name and password for the JDBC connection
  • SouceXML is the name of the file to be processed
  • Element is the name of the element which forms the root element of each fragment which is to be loaded
  • Table is the name of the table that the fragment should be inserted into
  • SchemaInstancePrefix is the prefix used for the XML schema instance namespace
  • noNamespaceSchemaLocation or schemaLocation is the SchemaURL for the target table
  • CommitCharge is how many records a thread should insert before issuing a commit
  • ThreadCount is the number of parallel writer threads that should be started.

As I noticed during some trial and error runs, some defaults are defined in “ConnectionProvider.java”, for instance where the “connection.xml” file resides. So if you run into errors, fix the path and rebuild the Java SAX Loader program.

Depending on the database version, you will have the choice to load XML data in:

  • CLOB
  • A CLOB based XMLType
  • An Object Relational based XMLType
  • An Binary XML based XMLType ( > 11gR1)

I created for testing purposes a dedicated tablespace called “MEDIAWIKI_STAGE” on my Oracle 11gR1 database (Oracle Enterprise Linux V5 32 bit VMware) with the following characteristics:

SQL> select dbms_metadata.get_ddl('TABLESPACE','MEDIAWIKI_STAGE')
2  from dual;

DBMS_METADATA.GET_DDL('TABLESPACE','MEDIAWIKI_STAGE')
-----------------------------------------------------------------
CREATE TABLESPACE "MEDIAWIKI_STAGE" DATAFILE
'/u02/oracle/oradata/ONTW/mediawiki_stage01.dbf'SIZE 104857600
AUTOEXTEND ON NEXT 104857600 MAXSIZE 10240M
LOGGING ONLINE PERMANENT BLOCKSIZE 8192
EXTENT MANAGEMENT LOCAL
AUTOALLOCATE SEGMENT SPACE MANAGEMENT AUTO

ALTER DATABASE DATAFILE
'/u02/oracle/oradata/ONTW/mediawiki_stage01.dbf'
RESIZE 24838209536

1 row selected.

The table WIKI_STAGE and WIKI_STAGE_ERRORS were created like the following statement:

SQL> CREATE TABLE "WIKI"."WIKI_STAGE" OF "SYS"."XMLTYPE"
  2  XMLTYPE STORE AS SECUREFILE BINARY XML
  3  TABLESPACE MEDIAWIKI_STAGE;

Table created.

SQL> CREATE TABLE "WIKI"."WIKI_STAGE_ERRORS" OF "SYS"."XMLTYPE"
  2  XMLTYPE STORE AS SECUREFILE BINARY XML
  3  TABLESPACE MEDIAWIKI_STAGE;

Table created.

So the only thing to do now is set the files (XML file and XML Schema) and you are set to go to run the Java SAX loader application and load you data into the Oracle database.



Click picture to enlarge

It is working. So now I am able to do some serious loading business.

😉

HTH

Marco

Related Posts

Marco Gralike Written by:

21 Comments

  1. July 12

    Marco,

    Great post. I’ve been thinking about doing something like this myself. Now I have a link to share when people ask me how to load large XML files.

    Thanks,

    LewisC

  2. July 12

    Your welcome, Lewis.

    😉

    I needed this piece for two reasons:

    1. To demonstrate to myself I could do this task in JDeveloper, as well as my normal routine to do it on the command line
    2. Because I need this as a start for some series about XMLDB performance and test scenario’s

    All the “groundwork” (and thanks) is off course done to Marchelo and Mark, but the problem for me is that those guys are far better Java programmers then me and they probably though that this is common knowledge (which it isn’t, at least not for me). It is almost embarrassing for me to see how much time I spend to figure out all the JDeveloper details.

    It was a good exercise for me in the end and I hope other people can use it.

  3. Michael
    July 21

    Nice post. Any comparison of loading with respect to performance and storage as compared to just a 10g xmltype column?

  4. July 21

    I wanted to, but I had to destroy my 10.2.0.3 and 11.1.0.6 software installations because I needed to switch from the standard to the enterprise edition.

    For now I pin pointed to the 11gR1 instance (binary xml). The moment I will start comparing CLOB agains Binary (next step, because they are closest in some of their use and behavior) I will involve the 10.2.0.3 (probably 10.2.0.4) software again as well.

  5. July 24

    Marco,
    great posting, there is one thing i am not clear with though: how big was the test file you used for testing the process (shown in the last picture i guess)?
    17Gb or less?

    Another problem that is haunting me lately is the definition of “big XML file”.
    What i am still missing to understand in this definition is the relationship between the size of the XML file and the available resources of the database instance. How are they related to each other? I mean, are we talking of a logarithmic scale, an exponential scale, a linear factor?
    I was doing some tests using traditional DOM validation with not incredibly large XML documents and object-relational storage model and i am getting really nasty errors with just a few megabytes (ORA-03113 in particular), so i was really questioning the approach, hence the interest for the SAX approach, although i find it overly complicated if you imagine that i would like to release a 1-click installation package…

    I am still investigating what are the real physical constraints in the traditional DOM approach because i’d like to have a clear picture before taking any decisions.

    Regards,
    Flavio

  6. July 24

    The XML file being processed is 17 Gb as shown on NTFS / via Windows Explorer.

    What is “big”…

    An XML document can be big, is seen big in the XML world if it exceeds 1.5 Mb. Or at least that is my gut feeling.

    It also depends a little bit the complexity of the structure and the way complextypes, or schemata via, for instance “include”‘s, etc, have circular dependencies.

    The XML document can contain a structure that could represent a piece off data that has been stored in a SAP Oracle Schema (with 1000th off tables etc), with 1 record in all these tables, being in total “the XML document”.

    So what is “big”…?

    Another problem is that, at current state, if you want a piece of specified data or a fragment then via the Document Object Module, the whole XML structure has to be build into memory before this structure can be used to search for the specified fragment or data piece wanted.

    So regarding DOM, you will have to take into account what the method is to do it, via Java (Oracle JVM Kernel) or via a more optimized method build in the Oracle C kernel. So “java_pool” can suddenly be an important factor to be aware off, besides PGA and other structures…

    Regarding the relational factors (logarithmic scale, an exponential scale, a linear factor), this is still one on my to do list.

    You could test with a statement done via XQuery. In 10.2.0.2 it has still dependencies with the JVM bit / loaded classes. In 10.2.0.3 XQuery has been further optimized and is dependent on C code.

    You should be able to read up on it when searching for “xquery” / “mdrake” / “10.2” via the OTN XMLDB Forum.

  7. suresh
    January 28

    Hi,

    We have xml files generated 2 to 4 each min with each file having 200000 records. This needs to be loaded into orcle 10g/11g into individual columns represented by the tags. All records in the xml file have similar tags with the ROWSET at the start and end.

    Regarding SAX utility, do we have to go through the steps you mentioned using the gui interface? As we will be on unix os, will it be different (should we download on unix etc…)? Also, can we schedule these loads just like we do on cron?

  8. January 28

    I am not absolutely sure what your question is here, but my guess (if I interpret your question correctly) would by to have a look at updateable “XMLType Views” (using an extra trigger for update)

    I am not a very good java programmer but the code used can probably made “executable” very easily by someone that knows java.

  9. Des McCrory
    September 12

    Hi Marco,

    A big thanks to you, and Mark Drake. After a week or so of frustration in building a solution, getting an “out of process memory” and back to the drawing board to find an alternative I had a solution and document loaded in one hour on a Friday night, and with one eye on the football on TV :-).

    I was about to try and build my own Sax Loader (and in PL/SQL too lol) when I came across your blog and the link to Marks code.

    Like you it also gave me an opportunity to play with JDeveloper, with some confidence this time thankyou and test my limited Java skills (I use the standard DBA response “I haven’t time to read the Java manual yet).

    Now I can go and give the good news to the client :-D.

    Kind regards,
    Des

  10. Bill Conniff
    October 15

    New technologies are being developed to addess the size and performance issues of xml. Xponent software’s XMLMax loads any size xml into a treeview using at most 20MB of memory and can do XSL transformations within the same memory limit. The CAX xml parser is a pull parser that can look backward through all parsed xml, thereby enabling any xml transformation with a fast pull parser and without memory constraints.

  11. October 15

    @Bill, sounds interesting, but is the XML parser content aware…?

  12. Davide C.
    January 19

    Hi Marco,
    this is a very nice post.
    For my information, how long it has to load the file?

  13. January 19

    For what it is worth (not that much actually if you start thinking about it) I started the program on a now I guess 7 year old 1Gb 1 CPU desktop to load the 17 Gb Wikidump data accross the network in multiple different setup Binary XML tables. The different storage setups took approx. between 20 and 23 hours. The Binary XMLType tables were not XML Schema driven at that time, that is, based on a registered XML Schema in the XDB Repository.

  14. Aaron C.
    March 4

    Marco,
    Thanks for all you and Mark Drake have done trying to catch the rest of us up on XMLDB. If given the goal to take an incoming large xml file similar in size to the mediawiki file above and shred it into ultimately relational, not object-relational, what would your recommendation be in terms of approach on 11g? I see most seem to favor a SAX approach and/or chunking the large file into smaller files. Is anything else even viable from an XMLDB perspective performance-wise? Given the final goal is relational is there an XMLDB approach that could compete with shredding into files outside of Oracle and then loading via sqlldr/external tables?

  15. doul
    May 3

    Hi,

    When I run the standalone application “saxloader”, whether using the sample file “Accounts.xml” or my own files, I get this error:

    ****************************************************************************************************************************
    Thread Writer_000001 started at 3 mai 2010 22:14:37 CEST
    Thread Writer_000001 using SQL : insert into ACCOUNTS_TABLE values (xmltype(?).createSchemaBasedXML(‘http://otn.oracle.com/xmldb/account.xsd’))
    BaseApplication.getConnection(): Database Connection Established
    Thread Writer_000002 started at 3 mai 2010 22:14:37 CEST
    Thread Writer_000002 using SQL : insert into ACCOUNTS_TABLE values (xmltype(?).createSchemaBasedXML(‘http://otn.oracle.com/xmldb/account.xsd’))
    Unexpected Exception raised in thread Writer_000002
    java.sql.SQLException: ORA-30676:socket read or write failed
    ORA-06512: à “SYS.XMLTYPE”, ligne 272
    ORA-06512: à ligne 1

    at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
    at oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
    at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:573)
    at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1891)
    at oracle.jdbc.ttc7.TTC7Protocol.parseExecuteFetch(TTC7Protocol.java:1093)
    at oracle.jdbc.driver.OracleStatement.executeNonQuery(OracleStatement.java:2047)
    at oracle.jdbc.driver.OracleStatement.doExecuteOther(OracleStatement.java:1940)
    at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:2709)
    at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:589)
    at oracle.otnsamples.xmldb.saxloader.examples.DocumentWriter.uploadDocument(DocumentWriter.java:336)
    at oracle.otnsamples.xmldb.saxloader.examples.DocumentWriter.run(DocumentWriter.java:286)

    Unexpected Exception raised in thread Writer_000001
    java.sql.SQLException: ORA-30676: échec de lecture ou d’écriture du socket
    ORA-06512: à “SYS.XMLTYPE”, ligne 272
    ORA-06512: à ligne 1

    at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:134)
    at oracle.jdbc.ttc7.TTIoer.processError(TTIoer.java:289)
    at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:573)
    at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1891)
    at oracle.jdbc.ttc7.TTC7Protocol.parseExecuteFetch(TTC7Protocol.java:1093)
    at oracle.jdbc.driver.OracleStatement.executeNonQuery(OracleStatement.java:2047)
    at oracle.jdbc.driver.OracleStatement.doExecuteOther(OracleStatement.java:1940)
    at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:2709)
    at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:589)
    at oracle.otnsamples.xmldb.saxloader.examples.DocumentWriter.uploadDocument(DocumentWriter.java:336)
    at oracle.otnsamples.xmldb.saxloader.examples.DocumentWriter.run(DocumentWriter.java:286)

    Thread Writer_000001 completed at 3 mai 2010 22:14:37 CEST
    Thread Writer_000002 completed at 3 mai 2010 22:14:37 CEST
    Active Children = 0
    Documents Queued = 5. Docuuments De-Queued = 2. Documents Written = 2.
    Writer_000001,3 mai 2010 22:14:37 CEST,3 mai 2010 22:14:37 CEST,0,0
    Writer_000002,3 mai 2010 22:14:37 CEST,3 mai 2010 22:14:37 CEST,0,0
    Process exited with exit code 0.
    *************************************************************************************************************************************

    So I wonder what the problem is.

    Could you help please?

    Regards.

  16. doul
    May 3

    Hi everybody,

    thank you Marco for your explanation…I did it eventually…now my XML fragments are right in the database.

    Doul

  17. OK. Would have guessed that the XML Schema was not yet registered in the XDB Repository ’http://otn.oracle.com/xmldb/account.xsd’ via DBMS_XMLSCHEMA

  18. David horton
    April 8

    Rejected by admin – Commercial spam

Comments are closed.