Please Stop with the HTML Replacements

Markdown.

BBCode.

Uncountably many Wiki markups, all different.

reStructuredText.

Textile.

Pottymouth.

Who knows what I'm missing.

Please stop.

I've limited that list to things with HTML as a strong primary output; I could keep on going if I dropped that restriction, but my point is really about HTML.

Let's just take it as given that each and every one of those technologies works and is easier than HTML. Here's my question: Is the total sum of them easier than HTML?

Hell no!

I think it's time to rethink the whole "avoiding HTML" thing. Decent HTML normalization is clearly possible, even under the harshest of circumstances. It may not be easy, but it's the kind of thing that with even half of the programming time of the aforementioned list would be a rock-solid, air-tight cross-platform C library by now.

In terms of formatting, those languages still have most of the problems of HTML. You allow people to write arbitrary links? You still need to validate them. You still have the problem of dangling delimiters. It looks like you've accomplished something by starting in a language without actual HTML tokens, but all you've done is re-name them. You still have <i>, even if you call it __ or /. You still have </b>, even if you call it *. Sure, you've got no HTML tag attributes to deal with, but that's just one part of the problem, and not really that hard to deal with.

You can still screw up the security... I know, I've "exploited" some earlier versions of y'all. (Snuck a "javascript:" link past a couple of them. Was once able to reconstruct a <script> tag due to poor string management, although that one was really buggy at that time and I'm not sure it was ever really popular.)

Further, quite a lot of you have your own stupid quirks, like this or this. My point is not that these bugs are unfixable, as they are currently "fixed" (though I wouldn't care to bet that such fixes don't introduce other problems), but that for every quirk of your own that you introduce, your utility over HTML goes down. This was especially bad on the Programming Reddit for a long time, where very smart geeks often want to throw code snippets at each other, and for a long while there was no way to do so. Even the ultimate solution ("prepend every line with four spaces") is completely unintuitive and generally has to be individually explained to each person who tries (and initially fails) to use it.

HTML may be quirky, but it long since handled adding < and > characters into the page. Sure, &lt; is weird, but at least it's only weird once. Funky formattings are perhaps complicated, but it's worth pointing out I typed that as I intended on the first try, something I could not do with many of those languages. And a normalizer can easily handle dangling italics and other such things.

(I left that <i> dangling; my normalizer handled closing it at the end of the paragraph. You don't have to let bad HTML screw up the rest of the page.)

Problems that HTML has fixed that your special HTML processors often make much harder or even impossible:

  • The biggest, baddest one of them all is escaping, frequently resulting in incomplete or underspecified ways of adding the new "special" characters. HTML's &lt; may be a bit graceless (and what I actually had to type, &amp;lt; even more so), but graceless > impossible. This also manifests itself as certain challenging character sequences, because they get interpreted as markup. Sometimes it merely requires jumping through new and complicated hoops, sometimes there are actually character sequences that are forbidden/impossible.
  • Especially when you're just backing to HTML, HTML has a rather wide variety interesting tags plus image for demonstration. This matters less when you're just allowing simple comments, but for more complex uses it's easy to miss some very useful cases. Did you allow those all? How much work was it? All I had to do was list the tag and allowed attributes.

The key to normalizing HTML is to realize that you have to treat HTML just like you are already treating your own languages. (Or as you should be treating your own languages.) Instead of compiling your language into HTML, compile HTML into HTML. I take the broader and more useful definition of a compiler, which is anything that translates from one data format to another; "source code" to "executable" is merely one special case, but compilers in general tend to share the same patterns and there is no great value in trying to come up with a new name for a program that generates HTML from LaTeX or something. The security of the total system comes not from the inability to express "bad" things in the source code, but in the inability of the intermediate representation of the compiler to express bad things. Parse the HTML into an intermediate representation that refuses to contain unescaped <s, and refuses to contain any attributes or attribute values you deem illegal. Then, the HTML generated back out of this intermediate representation will be safe.

You shouldn't try to transform bad HTML into good HTML with regular expressions, for the same reason it's a bad idea to try to transform your special text markup into HTML with regular expressions, although I have no doubt several of you work exactly that way.

HTML has the advantage that it's the natural language of the web. Even if it's a pain to learn, if you learn it once, it ought to be broadly applicable across all Wikis, weblog comment fields, forums posts, etc. Except it isn't, because everywhere I go now there's another one of you randomly re-writing what my asterisk means.

Please stop. Please take some time and just take in HTML and normalize that. You'll find it's not really that much harder than correctly implementing your HTML transform anyhow.

(By the way, if you don't know how to write a compiler and you feel tempted to try to write one of these things, you really ought to learn, because that's what this task calls for, not a string of regexs. And one of the advantages of sticking to HTML is that you get to use off-the-shelf parsers, which is the hardest part of writing a text-to-text compiler. The process of converting some sort of stream into a token tree and dumping that token tree back out to a stream is one of the fundamental tools of programming.)

Please note that a critical part of my argument is the sheer mind-boggling number of these solutions. If we could all standardize on one syntax, I'd have no complaint. HTML definitely has its problems, I am well aware of that. But those problems are less than my having to learn a new markup for every second site I visit. For goodness' sake, y'all can't even agree on how to make a bit of text bold! <i></i> may be relatively complicated, but it is at least standard.

This is also just about HTML replacements; using a single representation that goes out to many things, of which HTML is just one, is a whole different story altogether.