Why XML Does and Doesn’t Fit the Real World

At current state, my world is a container and it is useless. I want to store content in it but I doesn’t comply to my container…

Today I visited Cary Millsap’s blog site and read his latest article about “Messed Up App of the Day” and could sympathize with his remarks. While looking for other articles that I maybe missed I saw Cary’s link “Joel on Software” and thought, “hey, I almost forgot about that cool website, so have a look”. Joel made an impression on me a long time ago with an article that was called “How Microsoft Lost The API War”(2004). I crawled a little bit around on Joel’s web site and found that really, really great article about “Martian Headsets“, that I could relate too, because I am a pragmatist trying to apply idealistic methods, but as said in the first line of this blog post, it just doesn’t work.

Being a “Martian”, I also like all my Martian colors of red… and I also like my “Qxyzrhjjjjukltk“. Cary has the same problem with his blog posts (you will understand if you read Joel’s blog post about the “Martian Headsets“) about “Messed Up App of the Day URL’s. He created a practical solution to a problem introduced by a idealistic standard.

The cool thing about XML is it’s natural human nature: It is free format.
The horrible thing about XML is it’s natural human nature: It is free format.

If I apply structured rules I know to it, the relational world view of handling data, then a lot of problems of handling it will go away (at least most of it). But…if I do…most of it’s strengths will vanish.

Let me try to explain…

“Measure Once, Cut Twice” and “Containerization”

Inspired by Cary Millsap’s presentation “Measure Once, Cut Twice“, during Hotsos 2008, I got an idea, the following is my attempt to explain my problem with storing data. I can’t explain it via’s Cary’s woodcraft abilities / scribing analogy, so my attempt is based on the logistic world of intermodal freight transport, were I started my first position in IT. The method that ruled my world was “Containerization“.

Let’s assume.

Let’s assume that a database it is like an unmoving vessel, like the following the super container boat filled with containers, aka tables. These containers contain the goods we are interested in: our data.

If we would, could stack our data at its more or less optimum, than an ideal situation could look like the following.

Loading and off loading is doable via stack handling optimizers, like the following picture is showing some examples.

It wouldn’t be so horrible if they all had sticked to a standard from the beginning as Joel wrote about in it’s post about “Martian Headsets“. Instead there are more then 40 (maybe even more) “standard” containers nowadays (20,40, 45 foot), causing a lot of problems during stacking on or offshore or during a transport medium like a train. To tackle the problems, they also try to retrofit the problem with an ISO standard (just like “the web standard“.

But a container doesn’t fit all the need to transport goods, because of its format. So there are different containers models for different needs.

Types

Various container types are available for different needs:[8]

  • General purpose dry van for boxes, cartons, cases, sacks, bales, pallets, drums in standard, high or half height
  • High cube palletwide containers for europallet compatibility
  • Temperature controlled from ?25 °C to +25 °C reefer
  • Open top bulktainers for bulk minerals, heavy machinery
  • Open side for loading oversize pallet
  • Flushfolding flat-rack containers for heavy and bulky semi-finished goods, out of gauge cargo
  • Platform or bolster for barrels and drums, crates, cable drums, out of gauge cargo, machinery, and processed timber
  • Ventilated containers for organic products requiring ventilation
  • Tank containers for bulk liquids and dangerous goods
  • Rolling floor for difficult to handle cargo
  • Gas bottle
  • Generator
  • Collapsible ISO
  • Swapbody

Just like handling relational data, making it structured, the containerization changed the world. Stack handling gets smarter and smarter, and more and more fully automated, just like the Optimizer within our Oracle database.

How to containerize a tree…

If you would compare a tree to XML data, how would you fit it in a container?

If it would be able to fit in a X foot container (XMLType CLOB based storage), then you would introduce a lot of empty (white)space, that you would have to process to get to the leave, the peace of data, that you needed. If you would would shred it to peaces, like Object Relational or Binary XML storage then you would

introduce a lot of overhead during processing, during the shredding and reassembling of the original structure. Reassembling complete branches (with leaves and all) or only one leaf also makes a huge processing difference. What about access paths getting two or more leaves situated in different branches or a leave and a partial branch.

As far as I know, there aren’t any flexible dynamic adjustable free format containers yet.

Also access methods are not perfect yet. Some new attempt are made, for instance via methods like Holistic Twig Joins, but it looks like these methods deal with a lot of processing problems. New methods handling XML like the Twig Stack method are probed. It looks like everyone nowadays tries to solve this universal problem of “unstructured” data, if it is XML or data in the BI world or search engine companies like Google. Searching for the ideal retrieval method of unstructured data is currently a hot topic.

My believe is that these kinds of problems are “universal”, if they are stationary like in a database (container terminal) or on route from port to port (via boats, trains or automobiles). One of the most important things of a database is that the data you put in, should be identical to the data you get out (in the same context), but XML and other forms of unstructured data, go even further: How to find the data you put in… In our current internet world, we need working solutions for correct stacking and retrieving of unstructured data.

If XML were idealistic then the problem would probably be solvable. The problem with XML is that we pragmatists like it so much… As Joel probably would say it…

If you have a way to buy stock in XML flame wars,
now would be a good time to do that…

😎

Marco

Marco Gralike Written by:

4 Comments

  1. April 23

    An acquittance of mine called Neil Gunther once send me a great link regarding a more graphical approach: http://techyum.com/2008/01/literature_as_cyclone_1.html

    The biggest issue, IMHO, is probably not how to store it anyway, but finding a way to retrieve data fast and efficiently.

    Semantics is one of the issues we will encounter. A graph doesn’t represent the content of a book.

    Another problem is that (and I tested the idea asking my wife), is that: if I type in “42” into Google, I will get the association I have with “42” (the hitchhikers guide through the galaxy). My wife hasn’t that association with “42”, although we had the same with “to be or not to be…”.

    Via Neil’s URL I also found “a four letter representation” which is really cool: http://toxi.co.uk/p5/base26/

    BTW cool post on your website.

Comments are closed.