HOWTO: Load Really Big XML Files

Sometimes you will want to load data from huge XML files into the database. So how do you achieve this?

There are more then one ways to achieve this, but most of the time a “SAX parser” is used. The term on Wikipedia for SAX is:

A Simple API for XML (SAX) is a serial access parser API for XML. SAX provides a mechanism for reading data from an XML document. It is a popular alternative to the Document Object Model (DOM).

The disadvantage sometimes of using DOM is that it uses to much resources in the sense of CPU and memory and for really huge files this method simply will not work in terms of performance. Parsing an XML document with DOM acquires the whole document to be loaded into memory before processing can be started. Via SAX only a small memory footprint is needed.

Read up on the Wikipedia pages what the advantages and disadvantages are while using SAX. So if you need to load a big XML document into an Oracle database like me, trying to upload the Mediawiki XML dumpfile, DOM validation will simply not work because of the limitations of my hardware.

A Mediawiki XML document has the following simplified structure:

< ?xml version='1.0' encoding='UTF-8'?>
<mediawiki>
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.13alpha</generator>
    <case>.</case>
      <namespaces>
      <namespace key='0'/>
    </namespaces>
  </siteinfo>
  <page>
    <title>AmericanSamoa</title>
    <id>6</id>
    <revision>
      <id>133452270</id>
      <timestamp>2007-05-25T17:12:06Z</timestamp>
      <contributor>
        <username>Gurch</username>
        <id>241822</id>
      </contributor>
      <minor />
      <comment>Revert some text</comment>
      <text xml:space='preserve'>#REDIRECT a lot of text</text>
    </revision>
  </page>
  <page>
    <title>AppliedEthics</title>
    <id>8</id>
    <revision>
      <id>133452279</id>
      <timestamp>2007-05-25T17:12:09Z</timestamp>
      <contributor>
        <username>Gurch</username>
        <id>241822</id>
      </contributor>
      <minor />
      <comment>Revert some text</comment>
      <text xml:space='preserve'>#REDIRECT a lot of text </text>
    </revision>
  </page>
  <page>
  </page>
  <page>
  </page>
  <page>
  </page>
  <page>
  </page>
   7 million page elements 
  <page>
  </page>
</mediawiki>

The content of the complextype “page”, is the XML fragment that I am interested in. At current state a dump XML file, like the one from Mediawiki that is, approximately 17 Gigabytes in size (ASCII, uncompressed), containing roundabout 7 million “page” elements.

Loading XML data

So what do you need?

You will need:

  • A load utility based on a SAX parser
  • An XML document with the data you want to load
  • An XML schema that represents the structure of the XML document

For the Mediawiki data this can be downloaded, for instance you don’t have to go the very big English version, as shown here:

Creating a SAX Parser

While trying to replay Marcelo Ochea’s example about “Uploading Wikipedia Dumps to Oracle databases“, I was facing the problem of trying to create a utility based on SAX. The old documentation and source code once demonstrated on OTN, as described in my post “Oracle OTN SAX Loader Example“, is based on “very old” source and Java libraries. For example, while trying to compile the code, error messages displayed that some methods were deprecated and not contained anymore in the Java libraries. I have not enough Java programming experience (yet) to circumvent this, nor was my intention to spend to much time to achieve my goal: loading XML data into an Oracle database.

My second attempt was based on the Java code as described and delivered by Mark Drake on the OTN XMLDB forum FAQ thread. This Java code is currently up to date while using it with the current Java libraries. It is described by Mark and divided into the following sections. I derived the names from the main classes, not from the title’s Mark used:

  • BaseApplication.java
  • ConnectionProvider.java
  • DatabaseWriter.java
  • FileWriter.java
  • LoaderStatistics.java
  • ProcessingCompleteException.java
  • SaxProcessor.java
  • SourceProcessor.java

and a configuration file called:

  • connection.xml

Be aware that “hardcoded” connection default settings are set in “ConnectionProvider.java”.

I know how (see “Oracle OTN SAX Loader Example“) I can compile the command line, the Java development kit and Ultraedit as my main tooling, but this time I was persistent to try to learn how this can be achieved in Oracle’s JDeveloper environment.

I am using JDeveloper more and more, mainly for XML and XML Schema design and I hope/believe the XML parser is in sync with the one while using Oracle XMLDB functionality, but never had tried to compile a Java application. So being a newbie on this, I tried to make it work and in the end I succeeded. For all those out there in the same position, this is how it works.

I used Oracle JDeveloper 10g Studio Release 3 (10.1.3- Build 3673) for the following example.

  1. Download Oracle JDeveloper via OTN and start it up
  2. Click on “File” in the menu and the on “New”


  3. Click picture to enlarge

  4. In the category “General” be sure that the item “Application” is selected and click the “OK” button.
  5. A “Create Application” window will pop up where you can fill in details.
    • I named my Application “SAXLoader” and want the source code to be in
    • the directory “C:\SAXLoader for Oracle”. After this I clicked on the “OK” button.



    Click picture to enlarge

  6. The next window popped up for details about “Project Name” and “Directory Name”
    • Filled in for Project Name: “SAX Loader”
    • The Directory name automatically changed and that was good enough for me so left it that way (“C:\SAXLoader for Oracle\SAX Loader”)
    • Now click on the “OK” button again.



    Click picture to enlarge

  7. Now mark and right click the “SAX loader” project. You will be presented with an option for “New” in the menu.


  8. Click picture to enlarge

  9. While clicking on “New” a new window will present itself with options. Pick the category “General” and in “Item” the option “Java Class”.


  10. Click picture to enlarge

  11. A “Create Java Class” window pops up. Here I entered the “Name” field with the same name as the main class name in the source code. So I called the first class “BaseApplication” as it is mentioned in the BaseApplication(.java) source. I didn’t change anything else and clicked the “OK” button again.


  12. Click picture to enlarge

  13. In the “BaseApplication.java” sub window, I selected all, by JDeveloper, generated standard class code and replaced it with the BaseApplication code from OTN, by Mark Drake and saved this situation via “Crtl-S” (the shortcut for “Save” it)


  14. Click picture to enlarge



    Click picture to enlarge

  15. I repeated steps 7 till 9 for all other the other Java code.


  16. Click picture to enlarge

  17. I added the connection.xml configuration file by clicking (step 7 and 8 ) “File” instead off “Java Class” and clicked on the “OK” button.


  18. Click picture to enlarge

  19. A “Create File” window opened and I entered “connection.xml” as the “File Name” and clicked on the “OK” button.
  20. JDeveloper now created an connection.xml file (and started to complain about no “root” element and not “wellformed” xml). This is correct of course because the file doesn’t contain any XML yet. In the “connection.xml” window on the right I copy / pasted the configuration file details and altered the parameters to my needs (different database, hostname, etc) as the ones Mark described). And saved the file.


  21. Click picture to enlarge

  22. Because we still some needed Java libraries, JDeveloper starts to complain about missing, for instance, OCI and SAXParser code so we have to add those Libraries. So right click on the “SAX Loader” project and the click on “Project Properties”. A new “Project Properties” Window will pop up were we choose “Liberaries” in the left pane and click on the right pane “Add Library” button to add the library for “Oracle XML Parser v2″.


  23. Click picture to enlarge



    Click picture to enlarge



    Click picture to enlarge

  24. In the end I added the libraries “Oracle XML Parser v2″ and “Oracle JDBC”.


  25. Click picture to enlarge

  26. If you “Rebuild” or “Make” your SAX Loader project it now “only” complains about naming conventions (“Warning name X does not match source file name”). You could fix that, but I didn’t (yet).
  27. The first time you actually run the application it will ask you to “Choose the Default Run Target”. I have set it to the “SaxProcessor.class” file.


  28. Click picture to enlarge

During the first “make” it came up with “Successful compilation: 0 errors, 0 warnings.”. So I guess I am set to go.

8-)

Preparing a Load Run

If you have a look at the configuration “connection.xml” file that controls some process parameters then you will have in the end something like the following:

Content of connection.xml

< ?xml version='1.0' encoding='windows-1252'?>
<connection>
  <driver>Thin</driver>
  <hostname>10.252.252.102</hostname>
  <port>1521</port>
  <servicename>ONTW</servicename>
  <sid>ONTW</sid>
  <servermode>DEDICATED</servermode>
  <schema>wiki</schema>
  <password>wiki</password>
  <sourcexml>C:\wiki\enwiki-latest-pages-articles.xml</sourcexml>
  <element>page</element>
  <table>WIKI_STAGE</table>
  <errortable>WIKI_STAGE_ERRORS</errortable>
  <schemainstanceprefix>xsi</schemainstanceprefix>
  <schemalocation />
  <nonamespaceschemalocation>
    http://www.mediawiki.org/xml/export-0.3/
  </nonamespaceschemalocation>
  <commitcharge>500</commitcharge>
  <threadcount>4</threadcount>
</connection>

You will have the following options (see also the description on the OTN XMLDB Forum by Mark Drake):

  • Driver can be Thin or OCI
  • Schema and Password are the database schema name and password for the JDBC connection
  • SouceXML is the name of the file to be processed
  • Element is the name of the element which forms the root element of each fragment which is to be loaded
  • Table is the name of the table that the fragment should be inserted into
  • SchemaInstancePrefix is the prefix used for the XML schema instance namespace
  • noNamespaceSchemaLocation or schemaLocation is the SchemaURL for the target table
  • CommitCharge is how many records a thread should insert before issuing a commit
  • ThreadCount is the number of parallel writer threads that should be started.

As I noticed during some trial and error runs, some defaults are defined in “ConnectionProvider.java”, for instance where the “connection.xml” file resides. So if you run into errors, fix the path and rebuild the Java SAX Loader program.

Depending on the database version, you will have the choice to load XML data in:

  • CLOB
  • A CLOB based XMLType
  • An Object Relational based XMLType
  • An Binary XML based XMLType ( > 11gR1)

I created for testing purposes a dedicated tablespace called “MEDIAWIKI_STAGE” on my Oracle 11gR1 database (Oracle Enterprise Linux V5 32 bit VMware) with the following characteristics:

SQL> SELECT dbms_metadata.get_ddl('TABLESPACE','MEDIAWIKI_STAGE')
2  FROM dual;
 
DBMS_METADATA.GET_DDL('TABLESPACE','MEDIAWIKI_STAGE')
-----------------------------------------------------------------
CREATE TABLESPACE "MEDIAWIKI_STAGE" DATAFILE
'/u02/oracle/oradata/ONTW/mediawiki_stage01.dbf'SIZE 104857600
AUTOEXTEND ON NEXT 104857600 MAXSIZE 10240M
LOGGING ONLINE PERMANENT BLOCKSIZE 8192
EXTENT MANAGEMENT LOCAL
AUTOALLOCATE SEGMENT SPACE MANAGEMENT AUTO
 
ALTER DATABASE DATAFILE
'/u02/oracle/oradata/ONTW/mediawiki_stage01.dbf'
RESIZE 24838209536
 
1 ROW selected.

The table WIKI_STAGE and WIKI_STAGE_ERRORS were created like the following statement:

SQL> CREATE TABLE "WIKI"."WIKI_STAGE" OF "SYS"."XMLTYPE"
  2  XMLTYPE STORE AS SECUREFILE BINARY XML
  3  TABLESPACE MEDIAWIKI_STAGE;
 
TABLE created.
 
SQL> CREATE TABLE "WIKI"."WIKI_STAGE_ERRORS" OF "SYS"."XMLTYPE"
  2  XMLTYPE STORE AS SECUREFILE BINARY XML
  3  TABLESPACE MEDIAWIKI_STAGE;
 
TABLE created.

So the only thing to do now is set the files (XML file and XML Schema) and you are set to go to run the Java SAX loader application and load you data into the Oracle database.



Click picture to enlarge

It is working. So now I am able to do some serious loading business.

;-)

HTH

Marco

Related Posts