XML Mapping to and from Objects

Today's post is about one of the little libraries I'm developing as part of Iron Lute. In the previous posts, I've laid out what I believe were successes; today I highlight what is so far a failure for a change of pace.

Please note that today's post is really more of an XML post then an outliner post; if you're interested in programming with XML, stay tuned. If you're only interested in outlining, you should probably move on. I've tried in the previous posts to be accessible to interested laymen, but this one may also be only useful to programmers. Consider yourselves fairly warned.

XML Marshalling

Assume that we actually know how to use Object Orientation to our advantage. Suppose we want to transfer data via XML, without restriction importing from or exporting data to XML documents.

As anyone who has worked with this knows, objects do not automatically map out to XML. There are libraries that help take XML into objects, like DOM, but you do not get to choose the objects, or if you do, you are still constrained to the XML format. In the other direction, you get things like this module for Python that will implement the "pickle" module in XML, but good luck using such language-specific marshalling in any other environment.

In OO terms, both styles of library end up causing excessive coupling, either couple the XML to the object model (to the detriment of others who would like to read the XML), or coupling the object model to the XML (which almost inevitably results in serious compromises to the power, flexibility, and usefulness of the object model).

If you want to retain flexibility, you generally need to write a seperate marshalling layer for your objects, which will manage both the translation in both directions. Then, your XML format and data structure may both be optimized for their respective uses with no compromises, but at the cost of maintaining this extra layer.

In Iron Lute, which will deal with both XML and non-XML formats, I've implemented this as a "filetype" abstraction, which is a Python module that keeps track of some metadata and some classes which are responsible for knowing how to converting the target format to and from the outline nodes in the memory format I've discussed earlier.

Quite a lot of those file types are of course XML or XML-like. I began to conciously recognize something that had been percolating in my subconcious already: In that middle layer, there is a lot of redundency. It is a peculiar kind of "paired" redundency, where the same pairs of operations are repeated, over and over again.

For one common example, it is quite typical to have something like the HTML element <ul<, which may contain <li>'s. From the XML into the Object Model, you have code that switches into some sort of "ul" mode, and reads the subsequent li's into a container object. Going the other way, you iterate over the container object and spit out each of the li's into the XML stream.

Because of this similarity, I began the process of factoring this similarity out. Of all the aspects of Iron Lute, this is by far the most experimental. Outliners have been done before; in the end the only truly novel bullet-point feature Iron Lute may end up with is that it is "quality open source Python". To my knowlege, the only other attempt at this sort of XML library is a Java library called JAXB, and truthfully I'm not certain that that library is aiming for 100% indepedence of the XML and custom objects like I am; rather it still seems to defer a lot of the structure of the data objects on to the XML format itself.

Theory: Why Is This A Good Idea?

XMLMapping (the current name of the library, which is horrid) is built on the idea that you may have a pre-existing XML format, and a pre-existing Object Model, and you may not want or even be able to bend one to suit the other.

In theoretical terms, this means that XMLMapping completely decouples the XML format from the data format. When you put it that way, I'm surprised that there hasn't been more work in this area.

The ideal I'm shooting for is to be able to blur the distinction between the XML representation and the object, with no compromises on either the XML representation or the object design. This flexibility opens a lot of doors, like being able to throw objects into a Jabber stream, or directly over a socket, with very little difference in code.

Theoretically, you get the benefits of a native OO representation and a native XML representation at the price of writing a description of a translation layer, which at least in theory should be fairly redundant between various mappings, and thus I should be able to provide significant chunks of the transform as library code that can be pieced together into what you need. (For instance, the aforementioned Container behavior is very popular, and easy to abstract into a component, even though that generalized component is moderately complex.)

Another way of looking at this is that it is a library for creating modular and extensible SAX-based XML parsers and a generalized XML output system, all at once.

Practice: The failure

In practice, there are two major problems:

  1. Even the simplest transforms require a lot of information. It's "declarative" information, not programming per se, but it's still a lot of typing in Python.
  2. Getting the library code correct is challenging.

The first can (and I believe should) be solved by created a "mini-language" that allows the use of non-Python syntax to concisely and efficiently express the transform. The second is only solvable by continuing to pound on the problem. ;-)

The second one has been the killer. Basically, you need to have some sort of concept of "piece of XML" that maps to "piece of an object". Perhaps surprisingly, it is the first part that has been giving me fits. A "piece of an object" can be relatively easily specified in by a function or pair of functions, so if I want to for example map the "header" member of some object to the XML, I can specify the 'header' member quite simply as:

lambda obj: obj.header

and for convenience it is trivial to define well-named helper functions that make this look ok:

def member(memberName):
    return lambda o: getattr(o, memberName)
member('header')

It is defining a "piece of XML" that has been hard. This "piece of XML" concept needs to work in two contexts:

  1. When XML is being read in, the XMLMapping interpreter needs to process each "piece of XML" in sequence.
  2. When XML is being generated, each mapping element outputs a "piece of XML".

The tension between these two uses is what is driving me nuts. For the second concern, the answer is obvious: Just output chunks of XML as needed, precisely as any conventional XML output system does. The downside is that that's not useful to the reader; as one small example, in isolation, </integer> is useless. To understand that chunk of XML, you need the whole thing: <integer>54</integer> still needs some context to be understood but it is certainly plausible that the Python expression 54 adequately captures the meaning of that chunk.

(I know there are some evident solutions to the problem as described above, but remember that the preceding paragraph is in some sense a transliteration of a problem in the code; an English solution to the English formulation of the problem is not helpful. All of the simple English solutions I've come up with don't translate into code, since they all require human discretion to work correctly.)

I initially started with deciding that a "tag" is the root unit of XML, but I have encountered several difficulties. First, consider the OPML format: The body has only one tag name, outline, but we almost always want to switch based on the nodetype, which is actually an attribute. So the tag name isn't sufficient, we need to match on the whole tag as well.

One can equally well imagine wanting to switch off of contents of a tag, too.

I've also encountered a wide variety of other potential issues as I try to use this library. For instance, I found myself wanting to have a "null wrapper"; an Iron Lute document is currently just a stream of nodes and links, so to bundle it into an XML document I just need to wrap them in a root element. The code as written couldn't handle a "do nothing" element; it wanted to parse and understand every single tag. Coding around this required several lines of icky, fiddly code.

One can imagine other circumstances where you might want to ignore entire chunks of a document. For instance, someday one might want to specify something by an XPath expression and parse just that chunk of XML.

I'm still optimistic I can make this work usably, and if I do I still think it will be quite impressive, and all the more satisfying for the challenge it was to create. I can see why previous attempts haven't been as ambitious though ;-)

Well, you know what they say, you learn a lot more from failure then success.

In development news, I'm getting close to at least a prototype of the Iron Lute file output format being usable. Some other touch-ups on the xml mapping code here need to be made, which fortunately will carry over directly into any future version of the library anyhow so that's not time lost. After that we get the GUI going, then just a couple more pieces of developer support before we get going that I don't want to release without and I'll have myself an alpha release.

Unfortunately, due to unavoidable life issues development may be slowing down on Iron Lute for a long period of time; my development time needs to go elsewhere. I'll work what I can in, but it's no longer my highest development priority. The new #1 priority is much less fun, but, well, life doesn't always go as we plan.