Hotsos 2013 – From Unstructured to Structured…

It has been a while that I have been attending Hotsos, although that is how it feels. In 2011 I flew to Hotsos to see, among others presentations from Maria Colgan, but I ended up being sick the whole week while learning on my hotel room to enjoy American TV. In 2012 I skipped Hotsos (10th year anniversary) thinking my schedule was too full with international presentations, but alas, that agenda cleared up expectantly, so in the end I missed out on some big conferences as a presenter and/or attendee.

Hotsos?

unstructured
You don’t know the Hotsos Symposium? The Hotsos Symposium is, in my honest opinion, is one of the most interesting best symposiums/conferences out there, when you goal is learning all about (Oracle) performance. This yearly happening takes place in Irving, Texas, in an Hotel on an isolated location somewhere nearby a highway some miles from Dallas downtown. When I read that Maria Colgan would do the Hotsos Training Day (an extra symposium option), without even seeing the rest of the symposium agenda, I knew I had to attend. Maria is a very natural gifted speaker and with her comfy way of addressing (difficult) problems and solutions, she nowadays easily attracts the same huge audiences like people out there, like Oracle promoter and enthusiast, Thomas Kyte. I don’t like promoting the difference between developers and DBA people (there isn’t one – believe me – or at least there shouldn’t be one), but Maria is for DBA, what Tom is for Developers (or that is how it feels sometimes).

Anyway… Looking forward to the symposium while being on the eve before my flight. This year I will be going with my colleague Remco (van Rijn). I hope that he will enjoy the symposium as much as I do: returning packed with (practical) ideas and stuff learned applicable for our AMIS customer challenges. I will be presenting once again during this years Hotsos symposium, along side some friends, people I admire and a bunch of Dutch people, that met the harsh criteria to be even “allowed” to present. Mark your (Dutch) agenda’s when we Dutch guys will be doing our “Hotsos Revisted – 2013” (free) mini conference.

Creating Structure in Unstructured Data?

wikipedia

For this years Hotsos symposium I set my own expectations once again a little bit higher, creating a new presentation. With all the buzz going on regarding “Big Data” and its focus on getting unstructured data under control, trying to make use of all this information, I pinpointed on making sense of Wikipedia data. The Wikipedia data is freely available as an XML data dump file on the internet. I picked the biggest data set possible, that is, the full Wikipedia English XML dump file. Last December 2012, it contained over 10 million pages of information. Every “string” (=Wikipedia page) contains structured and unstructured data elements and various from a few KB up to bigger the a few MB. All these pages are distributed in 1 big XML file.

My challenge in all of this was to get this data under control: loading it efficiently in an Oracle database, index the bits and pieces I was interested in and make it searchable for use. “Searchable for use” would focus on, the structured (fixed) bits and pieces of info, but also the completely unstructured data. The bits and pieces of info we most of the time are interested in while searching Wikipedia for knowledge. What made the challenge extra interesting was trying to do all this on my own personal computer. A computer which is limited in resources, running the actual used virtual machine environment, running a database that was limited to “only” 4G RAM of memory at its disposal. In all a daunting task trying to consume a single (XML) string (file) of 42.5 GB of data. From time to time I questioned myself, since I started building up to my new presentation in early December 2012, why I set my goals so high, but then again…that’s what Hotsos is all about.

I gathered a lot of data for my presentation, since those cold December nights, while attempting to create “Structure in Unstructured Data”, fiddling with the huge amount of Wikipedia information. An hour presentation time , in all, is too short to fit the lessons learned spend during my evening hours, the last 3 months. On Monday 12:00 AM (EST) I will know, based on the feedback of the Hotsos audience, if my attempts too handle the data will pay off and they liked the content. I will be happy if the people attended will feel it as “an hour well spend”. After that I can relax and start enjoying Hotsos and the presentations of all those other very knowledgeable people.