What is an Outline, Part 4

2004-02-10

In my last post, I dissociated the concept of "outlineness" from a graph, and showed at least the skeleton of a data structure that allows the power of graphs while preserving the nature of an outline.

In this post, we will fix a flaw in the model built up to that point, which is that there is no way to obtain a list of parents, given a node, only a list of children. For various reasons, this is necessary to building an outliner, so this flaw must be fixed.

Terminology note: A "link" connects one node to another. In a graph, it's a "edge", but in my mind, it behaves more like a link, so that's the name it gets. Right now, we can only follow links from parents to children; we want to be able to follow them back, too.

Again the consequences are subtle, so I'll spread the Pros and Cons of various decisions out with the discussion. First:

Issue: How we will enable the ability to follow parent links of nodes while sacrificing as little as possible?

Once again, we will start out with...

The Obvious Solution: Record Incoming Links In Nodes

The obvious solution is to simply add a list of nodes that have incoming links to a given node, and update it as it changes.

One obvious Pro is that it is simple. And for once, the obvious simple solution has the virtue of not being completely wrong. However, it does have two Cons that are worth thinking about to see if we can do something about them:

Con: Suppose you have a node that has the same node twice as a child. An A node with two children, both the same B. What does in the incoming list look like? Well, if you have A's child list look like this: [B, B], and B's incoming link list look like this: [A, A], you have a problem: You can't disambiguate which link is which, since both are A->B. This causes some uncomfortable problems in code which tends to assume that they are distinct, such as link removal code. (For instance, if I have a child list of [B, C, B], and I am instructed to remove the link to B, which one is it? It matters.)
One viable solution is simply to ban this, since it's hard to think of a solution where this could be a good thing. It is often the case that these sort of edge cases are best handled by banning them. But let's consider the whole problem before committing to a solution...
Con: Done naively, this now means that whenever somebody's child list is manipulated, they must also manipulate the parent list for the appropriate node. Since this is going to be the same sort of thing every time we do it, it'll be best to centralize it.

Technical Python note: There is a third problem, which is that strong bi-directional links weaken the garbage collector; it can handle it but it's not as good as ref-count garbage collection. We'd prefer to do away with the strong child->parent references, but forcing everyone to maintain the weak links themselves is even more complexity (and potential errors by programmers who really shouldn't need to understand weak links to correctly use this data structure, this should be pushed under the hood).

We need to abstract this handling into some kind of easy functionality, preferably transparently integrated as appropriate (so ideally, nobody has to deal with this other then me). We could just create some functions, but we actually do benefit from creating a Link class, because it turns out to be really nice to refer to instances of the Link class ("links").

"Links" As A First-Order Concept

A "Link" can be though of as a staple, where the nodes are little pieces of paper. Each link can be attached to two nodes, a "source" and a "destination", where the source is the parent and the destination is the child. (It is not technically correct to refer to a Link's "parent" or "child", because a link does not have either of those things. It has a source and a destination, so that's the terms I use. A link defines a parent-child relationship, it does not have one itself.)

We can actually instantiate a link without knowing either the link or the destination right away. You can envision that as a staple that hasn't been used yet, so both ends are unbent. As we connect it to nodes, we wrap the ends around the nodes. Of course, since this is a software staple, we can also "unbend" the links and reconnect them to different nodes without penalty. In my implementation, you can accomplish this with code like the following. If A and B are nodes, that we're going to connect like A->B, you can do the following:

l = Link()
l.source = A
l.dest = B

(As a shortcut, you could also do l = Link(A, B), but I wanted to emphasize how you could manipulate the ends of the link indepedently.)

As a practical matter, the link only actually connects when you give it both a valid source and destination, so technically, in the code example, nothing happens to either A or B until the last line executes. It will also disconnect from both nodes immediately if you invalidate one of the ends. (In Python terms, you rebind either source or dest to None, although source and dest are actually properties so that's not quite right terminology either... you can del them too if you prefer.)

We bundle into the Link class the intelligence necessary to update the child and parent lists of the node appropriately. Instead of links to the nodes, we actually store the relevant link objects. Of course, we still present easy access to the programmer by following the link as appropriate, so if they write code with A[0], they still get B without having to manually write A.data.outgoing[0].dest. (data is the data source for A; remember that? outgoing is the list of outgoing links. [0] gets the actual link, and finally, dest returns the actual node the link points to.)

This is a minor OO violation, in that there are data structures in the Node class (the incoming list of links) and the DataSource class (the children list) that the Link class is responsible for handling. Factoring this out isn't worth the pain right now, although it will be later; in the current-as-of-this-writing version of this datastructure, there actually is a special IncomingLink data structure that mediates between the link and the node for the incoming links.

While I have no use for it at the moment, we can actually now load other data onto the link as appropriate. Generally, in documents, this is not useful, but there are some data structures that this may be useful with. We're not paying anything for this, or letting it affect the design, so it's not a YAGNI violation.

I think there may be later uses for this link as well, but as I've not thought through this part of the design yet I can't say much about it. Luckily, it's a cleanly-seperated part of the data model, so I can replace how links work later and it will have minimal (ideally, no) impact on the rest of Iron Lute.

(For instance, I visualize a web template system that tracks links made to other internal documents and thus tracks which parts of which documents are dependent on others, re-rendering things only when necessary. If we replaced all Links in such documents with Links that also reported to the central repository when they link to external documents, then we could easily implement this repository with very little code, and the Iron Lute GUI would need no changes at all.)

Interconnecting Outlines

Up to this point, I have gone quite far in resolving the "What Is An Outline?" question in the conventional case. It's an outline tree structure overlaid on top of a graph structure. In a conventional outliner, the underlying data structure is never allowed to be anything but a tree, which makes it even easier. If we only opened one document at a time and never allowed them to interlink directly, we'd effectively be done now.

But earlier, I mentioned that one of fundamental design goals is to have multiple documents truly interacting in the outliner. When I'm in one document and I choose "Save", I need to save only that document, not all the nodes I have in memory.

Terminology and Assumptions

For our representational convenience, let us define a node T that is a "transcluder"; it has a data source that loads an outline from the web. The document it loads will use numbers for the nodes, as in: 1->2->3, which allows us to easily distinguish those other nodes. The number document may also contain a #T, which transcludes back to the original document.

So imagine that we have two documents, A->B->T and 1->2->#T. When we load the first, we only display A. When we expand B and then T, we then load the other document. Finally, when we expand #T, we get the original document transcluded, so the display now shows A->B->T->1->2->#T->A. This has a cycle if we have both a T and a #T node.

For the moment, let's suppose we have write access to both the documents and that Iron Lute knows how to save to even the remote document, so we have total freedom to manipulate both documents. (How this is accomplished cleanly internally is the subject of a later posting.)

What is an Outline Document?

It's time to ask ourselves what it means to have an "outline document", with the emphasis placed on the word "document". Why do we care about what constitutes a "document"? Well, abstractly, we're just dealing with a seething mass of nodes and links, existing with no particular relation to each other beyond their links. This is the traditional conception of a 'graph'. However, this program exists in the real world, where users like to do things like hit the "Save" button and close the program, expecting that their data has been saved.

In a strict tree outline, it's easy to declare the root node of the tree as corresponding to the "document". Saving a representation of that particular node, along with all of its children, will result in the entire document being saved; starting from any other node is guarenteed to miss something.

For our purposes, the first attribute is worth saving, while the second attribute is to be discarded. There must be a path to the root node, but it may not be unique and it may require careful choice of which parent links are followed. We can specify that from a "root node" of a document, all children that we consider to "belong to" that document should be reachable by following the children. In a full graph, therefore, there may be multiple nodes that are a candidate for being the "document".

We're going to use what document a node is in all over the place in Iron Lute. For instance, if the node is part of an HTML document, we will present the user with HTML-based commands for manipulating the text, whereas if we're in a LaTeX document, we'll present LaTeX-based commands. Given a node, we need to be able to determine a unique document that that node belongs to for these purposes. Also, for cases like OPML transclusion, we need to be able to distinguish between a "link node" and its "transcluded document"; since the link node will belong to the original document, but the children will belong to the newly transcluded document, we know not to insert the transcluded nodes into a saved representation of the original document.

At the moment, the data we have to work with is:

There exists some nodes that are defined to be the Document nodes. For instance, if we load in an OPML file, the "root node" of the OPML is defined as the Document for that OPML file.
Each node has a list of incoming links.
Each node has a list of outgoing links.

And let me give a hypothetical that seems to largely capture the spirit of the problems we need to address:

Suppose you have three documents, A->B->C, D->E->F, and G->H->I. The first is completely out of your control and is dynamically changing. You load them all into your outliner, and you copy C and paste it as a link under both F and I. C now has three incoming links, one from each document. Now, suppose the remote user deletes C from the original document. What should happen?

You have three basic choices:

Delete the node from all documents. (Or "invalidate the link", which has much the same effect for any reasonable definition of "invalidate" I can think of.)
Leave the C as one node for both remaining documents (i.e., a change to one continues to be a change to the other), and pick a new "home document" for the node.
Allow the original node to be be destroyed, and put two need copies of C into the two remaining documents.

It is not obvious what to do from here; there are pros and cons to each. However, one thing is clear: In the general case, only the first one is practical. To consider an analogous case, consider linking to a remote web page on the web. If a link is broken, it is broken, and you can do nothing to fix the brokeness on the target end, since you only control the source end.

If you control all the documents, you may be able to do something more clever. This says to me that there may be room to change which behavior is used depending on circumstances, but that we must be prepared to implement and handle the first case if necessary. Sometimes we may not even be aware the link is broken until transclusion time. Therefore, that is what we will start with, adding other behaviors later as they are called for in the UI.

This is sad, in that we'd like to maintain integrity of the node structure, but if we learn anything from the World Wide Web and Xanadu, the gyrations necessary to try to do so are not worth it in the general case.

With this discussion in mind, let's get more detailed about what it means to determine what document a node belongs to.

What Document Is A Node In?

Suppose I give you a node, which as described above has some incoming links, some outgoing links, and some data. How can you tell me what document this belongs to?

I'd like to try to do this without extra information, to save memory and to prevent situations where the extra information gets out of sync with the rest of the structure of the outline.

Can you do it?

Almost. The reason why not is so subtle it took me a couple of weeks to notice it.

"Primary" Links?

An equivalent definition of a directed tree (given that the graph is fully connected) is that each node has precisely one parent, except one root node that has 0. Any (fully-connected) graph that meets this constraint is a tree; the proof is trivial enough to be assigned as a very simple homework problem in a graph theory course. (It is sometimes even used as the definition; there are several different definitions of "tree" that look different on paper but are equivalent mathematically.)

We have an ordered list of incoming links; we can give the first one special prominence and call it the "primary link". It is also easy to show that if you enforce a couple of simple restrictions, that you can always reach a Document node by following the primary parent. Thus, we can define a node's document as "The document I find if I follow the primary links until I reach the first document."

Conceptually, we are projecting a tree on top of our graph. We end up with a tree, as defined by the first incoming link for a given node, that is extendable into a graph by adding more primary links.

This answers the questions posed about about what happens if a node is deleted from a document when there's another document that has a link to it; the "primary link" is removed, and the next link in line automatically becomes the next "primary link" with no fuss.

This is very simple, and you can actually get quite far with it. It is conceptually well-defined and you don't seem to hit any problems with ambiguities; the moving of a node from one document into another can be handled with a bit of work, mostly just notifying the new parent that it has been changed.

Unfortunately, as appealingly simple (in execution, if not in concept if you're not used to thinking in graphs) as this is, it does not work.

Conceptual Impurity

There are two equally valid ways of expressing why this doesn't work, one couched in theoretical terms and one in purely practical terms.

In theoretical terms, this violates the Conceptual Integrity of the design. We now have two distinct "outline" overlays on top of our graph structure, the "Link Handles" and the "Primary Link". Each of these overlays behaves very differently, and have different purposes in life. This is a bad thing.

Practically, this results in a lot of problems that stem from the fact that we are highly dependent on both the order and the existance of the incoming links. This also reveals a theoretical flaw, our dependence on knowing all of the incoming links, even though we can't necessarily know them.

When we remove a primary link, we need to know what the next link is in order to know what will be promoted. In order to maintain behavior between saves, we need to save all of the incoming links, in order; if you miss even one link, that might have been the one that should have promoted. Since we can't even assume we know what all of them are, this is impractical.

A lot of practical problems arise as a result. We can't save documents. We can't load documents. We need the entirity of the Outline Web accessible to determine what will happen if we delete a primary link. We need access to the foreign documents to move copies of the node into them if they are the next "primary link".

Basically, the "primary link" abstraction just works as long as the outline nodes are all in memory, and they can all be saved and loaded like a memory snapshot, not like files. This is an unacceptable limitation.

The solution is for nodes to actually keep track of what document they belong to, and for the nodes to keep track of what incoming links are coming in on a document-by-document basis. (Actually, it suffices to track "from my document" vs. "not from my document", but it's convenient and cheap to partition them based on document; that way, if the node changes documents (cut & paste), we don't need to recompute anything.)

Interestingly, just as much as we need to keep track of the order of the outgoing links, we do not want to give order to the incoming links.

The End - For Now

We've now built an outline data structure, that with one small penalty (remote links may be broken, which is something we can't do much about anyhow), allows the flexibility of graphs with the ease-of-programming of outlines. On top of this structure, we can build an outliner that will be robust and flexible.

Next, I will explore some misc. practical issues brought up by this discussion.