What Every Programmer Needs To Know About Encoding

2008-03-31 (Last Modified: 2014-11-24)

In many modern languages, encoding errors are the number one cause of security flaws in software.

This is going to be long because if you don't have a deep understanding about what is going on, you too will write encoding-based security flaws. Given the widespread state of ignorance about this situation, including a large number of people who don't even believe there is a problem, I do not believe I can make this much shorter.

But before I can discuss any sort of solution, what exactly is the problem? Let us start with a parable.

The First Encoding Error

In the beginning was the Flat Text File, and it was good.

Todo:
* Eat breakfast.
* Eat lunch.
* Flunk all my students.
* Sleep.
* Repeat tomorrow.

And there was reading and there was writing, and it was the First File Format.

And lo, the Accountant did receive word of this First File Format, and he did come to the Programmers and declare, "Behold, I require the writing of many columns of knowledge, bearing witness of the movement of gold and jewels as they flow hither and yon." And the First File Format did beget the Second File Format, the Comma Separated File.

Consolidated Consolidations, 11/22/1823, $-45.33
Limited Limits, 11/23/1823, $33.48
Microscopic Microscopes, 11/23/1823, $19.73

And there was reading and there was writing, and it was the Second File Format.

And the Accountant did record the flow of gold, and frankincense, and Michael Jackson albums, and figs. But the Accountant was wroth, for he did enter the number of three thousands and eight-score and five and forty-three centimes, and suddenly of columns there were four:

Figgy Fig Figs, 11/24/1824, $3,165.43

The Account's mighty software saw that transaction as $3, and the error was numbered three thousands and eight-score and two and forty-three centimes, which was many in the eyes of the Accountant.

And thus was born the First Encoding Error, and the land of the Programmers did fall into darkness and disrepute, where they remain until this day. And there is much wailing and gnashing of teeth.

Understanding The Problem

OK, so flat text and CSV weren't the first and second file formats. Dramatic license and all that.

Hopefully it's obvious what went wrong in that story. What's less obvious is that there are several ways of thinking of the problem, and that some of those ways are much better than others. Unfortunately, the simplest way of understanding it, something like "the computer misinterpreted the comma as a delimiter" is also the least enlightening.

Encoding

We will start at the beginning, because as simple as the following will sound, the evidence strongly suggests that most people harbor fundamental unexamined misconceptions in this area.

In Computer Science theory, a "language" is a set of "character" sequences that are "valid" strings in that language. "Characters" are abstract entities, which can be anything.

In computer programming practice, the term "character" is overloaded to mean too many things... which turns out to be a major contributing factor to the confusion about encodings! In particular, the word can refer both to the English letter "c" and to the single C-language "char" that contains the computer representation in memory. So let's split the concept "character" into two words for the purposes of this essay: A byte is a concrete number in computer memory. A code point is an abstract things like the letter "c", or whatever other "character" you might come up with is. The term "code point" is borrowed from Unicode so you'll have a better chance of understanding Unicode after you read this, though I don't intend to talk about Unicode otherwise. (Otherwise, that's a dumb choice of words.)

Let's start simple and use "case-insensitive English words" as our example language. We would understand the code points in this language as the 26 letters "a" through "z". Computer science also talks about "validity", which is "what are legal strings in the language", but today I'm just talking about encoding, so we can ignore that.

Now, let us suppose we want to represent the word "cat" in a computer. What good is a computer that can't even store "cat"? Well, in fact, a computer can't store "cat", because a computer's memory can not store the code point "c". A computer's memory can not store such an abstract entity.

A computer can only store a only store a very particular set of things. At the most fundamental level it can only store one of two values which we typically call "0" and "1", which constitutes a bit. Let us step up one level to the aforementioned "byte", a collection of 8 such bits, which we typically name with the numbers 0-255.

This special set of code points I will refer to as the privileged code points. It is privileged because it is the only one that can exist in real memory; quite a privilege indeed! Everything else that we will discuss is a consensual hallucination shared by programmers, as embodied in their programs.

So, we want to be able to store an English-language "c" in our computer's memory, but the computer only understands its privileged code points, which does not have "c" in it. We need an encoding. An encoding is an invertible function that takes some list of code points from one language and maps it to a list of code points in another language.

One encoding that maps between "English characters" and the privileged encoding is the standard ASCII encoding. The ASCII encoding is best viewed as a function mapping letters (and other things) down to numbers the computer can understand, which can later be inverted to go back to the English character.

Using my extemporaneous notation:

Apostrophe on a function name means "inverted".
Double-quotes indicate an English character code point, as distinct from the code points a byte can carry.
Square braces indicates a list, delimited by commas. A string "abc" is also a list of the relevant code points, i.e., ["a", "b", "c"].

we can say:

ASCII(["c"]) = [99]

and

ASCII'([99]) = ["c"]

This is a simple and common case, where the "list" of code points is readily conceptualized as a function that converts one code point to another, as in

ASCII_single("c") = 99
ASCII_single'(99) = "c"

Not all encodings are this straightforward, but for the purposes of this post, we'll stick to such encodings, and encodings where a single code point may expand to multiple code points in the target language, but there are no inter-code-point dependencies.

Here's one of the most common unexamined misconceptions: 99 is not the English character lowercase c. There are three entities in play here as shown in the equations above, and they are all distinct: 99, "c", and ASCII itself.

There are an infinite number of ways to encode English characters, in theory. There are a rather smaller number of ways to do it in practice, but still very many. ASCII is not the only one. For instance, using EBCDIC:

EBCDIC(["c"]) = [131]
EBCDIC'([131]) = ["c"]

[99, 97, 116] is the ASCII encoding of "cat". [131, 129, 163] is the EBCDIC encoding of "cat". Which is really "cat"? Neither of them. "cat" can not be represented in computer memory, only members of the privileged code point set.

To drive home the idea that code points can be anything, consider the ASCII control characters. There are things you might see as "characters" in the traditional English language sense with some squinting, like HT (horizontal tab), but there are things that can only be thought of as actions like BEL (sound a bell/beep) or LF (line feed), and there's a NUL character which is really weird. Code points can truly be anything.

Aaaaaaand... that's it, actually. That's all there is to encoding. Oh, some particular encodings are a little more complicated and may be based on something more complicated than a lookup table, but this is still the only idea. But as is so often the case in programming, we tend to take simple things and layer and combine until they are no longer simple, so to truly understand what encoding means in practice, we must move on to non-trivial examples.

Applying Encoding in the Real World

Having carefully defined what an encoding is, we can return to our parable and now explain precisely what went wrong, without reference to vague phrases or unexamined misconceptions.

The "flat text file" is a sort of minimal file format, almost the anti-format. It has only one characteristic: what encoding the contents are in... which is unfortunately usually only implied, not stated. Guessing it is tricky and unreliable if you don't have some other way of knowing which encoding the file is using. But given the historical period the parable is set in, we can simply assume that the dominant encoding of the operating system is used, and that in this mythical Time of Yore, they didn't have to worry about having a choice.

The CSV file format has slightly more structure; I'm going to define a CSV that is almost the minimal definition of a file format that can live above plain text. My CSV defines two additional special code points that all CSV-formatted files can use.

One we can call the VALUE_SEPARATOR. The VALUE_SEPARATOR is not a "comma". It is a code point. It indicates the end of one value, and the beginning of the next one, and so it shares more in common with an abstract code point like HORIZONTAL_TAB than an English character. It must be encoded somehow in order for the file to be written to disk, since disk, like memory, can't write out code points, it can only write the privileged code points 0-255.

The other new code point is the ROW_TERMINATOR, indicating the end of a row of values. One last time and I'll stop driving this point home: ROW_TERMINATOR is not an ASCII NEWLINE. ASCII NEWLINE is the conventional encoding, but it is not the same thing.

Let's say that we want our CSV file format to be able to include any ASCII values in the column values, which is a very reasonable thing to want. (In fact, anything else is just asking for trouble later on; those wacky users will stick pretty much anything into any field when you least expect it, to say nothing of deliberate attackers.) Given this, how do we encode our CSV files into the privileged code point set for storage or transmission?

Wrong Solution #1

The CSV file format has two characters it needs to encode, VALUE_SEPARATOR and ROW_TERMINATOR. The traditional encodings are ASCII COMMA and ASCII NEWLINE. For concreteness, I will use the Unix standard "line feed" character, encoded into the privileged code points as 10.

The obvious solution is the following (in the working psuedocode known as Python):

def writeLine(fields):
    print ",".join(fields) + "\n"

Which you can enter into your friendly local Python interpreter. If you do, and you feed it:

writeLine(["Figgy Fig Figs", "11/24/1824", "$3,165.43"])

You will see printed:

Figgy Fig Figs,11/24/1824,$3,165.43

It can't get much simpler than that, can it?

Here's the problem, expressed clearly in the terminology I've now built: Both the ASCII comma as a part of the value, such as the Accountant tried to use in his number, and the VALUE_SEPARATOR in the CSV file format were mapped down to the same privileged code point, 44. Uh oh. That means this supposed "encoding" is not reversible; two distinct inputs lead to the same final output, so that implies that when the CSV reader encounters a 44, it is impossible for the CSV parser to know which is meant.

Please take note of the word "impossible". It is not being used as a rhetorical device. I mean it. It is truly impossible. The information about which code point it initially was is gone. It has been destroyed. It is not correct to say that the CSV parser is "misinterpreting" the "$3,165.43" as two values. That implies that the information is present, but the CSV parser is too dumb to figure it out. The information is in fact not present; even a human can not look at "$3,165.43" and be sure that what is intended is three thousand, and not three dollars followed by some other information. A human can make a good guess, but it is still a guess, and accounting is just one domain of many where making this kind of guess is inappropriate.

This is one of the things I find most frustrating about using any library created by others that involves any sort of encoding. When some layer of libraries screws up the encoding process, information is commonly destroyed. It is not possible for a higher layer to correctly recover from that situation, no matter how clever you are; you might be able to avoid crashing, but you've still just hidden data destruction, which is often something that ought to be brought to someone's attention, not silently hidden.

Wrong Solution #2

The root problem in Wrong Solution #1 was trying to jam 258 code points into 256 privileged code point slots; by the pigeonhole principle we know that can't work. So, the next most easy thing to do is to carve out 2 of the privileged code points and declare that they will encode our two special CSV code points, and they are forbidden from being contained in the values themselves.

If you happen to have a copy of Excel or the Open Office spreadsheet handy, it can be instructive at this point to pull open the dialog that imports a CSV file, and poke around with the bewildering array of options you have for delimiting fields. Many people have chosen many different ASCII values to carve out, and tried different solutions (quite silly) solutions to re-allowing those code points in values (like double-quotes around values).

This solution is less wrong in that it at least does not build ambiguity right into the specification; it's completely specified. The problem is even with CSV, which is pretty minimal as file encodings go (two extra code points over plain text), there are no two code points that you can say to your users in general "You may not use these two code points in your values."

First, obviously, commas are pretty useful, so you're not going to want to ban those. The import dialog will show you just how many other delimiters have been tried. Tab is the most common. Much more exotic than that, and you lose one of the major benefits of a CSV file, which is that you can profitably edit it in a plain text editor that knows nothing about CSV. Any character you can type to use as the value delimiter, you can also want to type to use in the value. The same concern holds with the ROW_DELIMITER, too; banning the newline to reserve it for ROW_DELIMITER is pretty harsh, even in a spreadsheet.

Second, you just never know what those wacky users are going to want to do, and if they need it, you may have to deal with it. While I've never seen it, I'd lay money that somebody, somewhere has embedded binary files like images into a CSV file. Maybe it wasn't the best solution. Maybe it was just a hack. Maybe, just maybe, if I knew all the details, I'd agree that it was the best solution to the problem. But regardless of the reasons for the embedding, once you've got a binary file, all bets are off; the value can contain anything, even nulls. No matter what two code points you try to reserve from ASCII, the binaries will contain them.

Even if you think this works in your particular case, it doesn't. And even if you still think it works after that oh-so-compelling "no it doesn't", you're still better off using a correct encoding anyhow so that you won't find out the hard way that no, it didn't work in your case after all. Because doing it right isn't that much harder anyhow!

Layered Encodings

We want the full ASCII set available to us. We want the full CSV set available to us. 258 values, and only 256 different values for bytes. We can't ignore the fact there are too many values, and arbitrarily cutting down the values to fit the characters is not practical. (And remember, CSV is just about the simplest case possible; it's easy to imagine that you might want more than 256 code points encoding something without even considering text; imagine encoding colors or polygon vertices or any number of other things.) The only option left is to virtually increase the number of code points we have to play with.

There are a number of ways to deal with this. When dealing with text, the most popular is escaping. Many varients of CSV, along with HTML, XML, and most other text-based formats use some variant of escaping.

The simplest escaping technique is to choose an escape code point (where I'm sticking to my "code point" terminology; it would normally be called an "escape character"). This is used to virtually extend our code point set by saying "When you see this escape character, switch into another set of code points", or some variant of a statement like that. (The exact meaning varies from encoding to encoding.) The traditional escape code point in ASCII is the ASCII BACKSLASH, and I will stick with that.

In this case, we're going to use the escape code point to move some of the standard ASCII code points out of the way, so our new layered encoding can unambiguously use them. We will encode the ASCII values in our file into a new encoding, ASCII_in_CSV, that we define as "the things that can be output by the following procedure Value_to_ASCII_codepoints":

For each ASCII character in the input, consult the following table:
- If the character is the ASCII comma, add "\," (BACKSLASH COMMA) to the new encoded output.
- If the character is the ASCII newline, add "\n" to the new encoded output.
- If the character is the ASCII backslash, add "\\" to the new encoded output.
- For any other character x, add x to the new encoded output.

It is perfectly permissible for an encoding to be defined this way, as the legal output of some procedure. It is also perfectly permissible for an encoding to borrow another encoding's code points.

Now we can layer our encodings, so that what we have in a CSV file is:

A series of ASCII code points, encoded down into the next layer with Value_to_ASCII_codepoints (a function from a list of ASCII code points to a list of ASCII code points),
embedded into CSV (with the CSV delimiters still as their code points)
encoded into ASCII with CSV_to_ASCII, which is encoded down into the privileged codepoints.
which defines the complete encoding from the top-level, most-symbolic CSV file into pure bytes.

Or, in terms of functions (on a simpler input), we have defined a CSV_to_ASCII function, which converts a CSV file as so (using simpler data):

Value_to_ASCII_codepoints(["b", ","]) = ["b", "\", ","]
Value_to_ASCII_codepoints(["1"]) = ["1"]

which is encoded into the ASCII code points with CSV_to_ASCII:

CSV_to_ASCII([["b", "\", ","], VALUE_SEPARATOR, "1", ROW_TERMINATOR]) =
  ["b", "\", ",", ",", "1", NEWLINE]

which we then feed to the standard ASCII function to obtain the final encoding:

ASCII(["b", "\", ",", ",", "1", NEWLINE]) = [98, 92, 44, 44, 49, 10]

This may seem complicated, but it suffers none of the disadvantages of the previous two wrong answers. It represents all characters unambiguously and completely (at least for writing, solving the problem for reading is easy).

If this sounds complicated, bear in mind I'm really belaboring this explanation for didactic purposes. The real code isn't that much more complicated, which is why I feel I can label the other two solutions actually "wrong", not just "misguided". The code for fixing the problem is far smaller than the discussion above:

def CSV_field_to_ASCII_single(char):
    # escapes 1 ASCII code point as described above
    if char in [",", "\\"]: # note need for encoding in Python, too!
        return "\\" + char
    if char == "\n":
        return "\\n"
    return char

def CSV_field_to_ASCII(value):
    # escape one entire value
    return ''.join(map(CSV_field_to_ASCII_single, value))

def writeLine(fields):
    # the same writeLine as above, only correct
    print ','.join([CSV_field_to_ASCII(field) for field in fields])

And if you enter that into a Python interpreter,

writeLine(["Figgy Fig Figs", "11/24/1824", "$3,165.43"])

will print out

Figgy Fig Figs,11/24/1824,$3\,165.43

You may recall a quick aside above where I called the various ad-hoc CSV solutions for things like allowing people to include commas or other things in their values "silly". I think it's silly because allowing people to include commas in their values if they wrap the value in double-quotes only pushes the problem around, it doesn't solve it; now, instead of needing to figure out how to encode a comma, you have to have a way to encode the double-quotes if they appear inside a value! You might as well just apply an escaping solution to the VALUE_SEPARATOR and perhaps the ROW_DELIMITER in the first place, as I do here.

Which is correct, complete, and unambiguous. And there is no longer any need for wailing and gnashing of teeth.

(In real code, I'm using a standard Python list to represent fields, so it doesn't exactly match my theoretical functions.)

Only in the rarest of circumstances would you be justified in taking the risks entailed by using one of the wrong solutions.

Delimited Encoding

The other basic way to nest encodings is to declare in advance how long the encoded field is. For an example, see the BitTorrent distribution file format, where a string appears in a .torrent file as an ASCII number indicating the length, followed by a colon, followed by the string itself. (Note the spec fails to specify an encoding for the string!) For instance:

22:Any, chars: 3: go here

represents the string "Any, chars: 3: go here", and any correctly-working parser won't be confused by the embedded "3: go" and think that represents a new three-character string.

This encoding technique is generally a little easier to read and write for computers when you can read or write the entire file at a time, but it's virtually impossible for humans to read or write by hand, and does not stream as well because it's too easy to end up with a large entity that you can't process with confidence until it completely comes arrives.

If you fix the length of the field in the specification of the file format itself, then you have record-based encoding, which is a very popular thing to do when storing binary data. The traditional "dump of C data structures" saved file uses this.

Wheels Within Wheels Within Wheels...

The wonderful thing about programming is that once we do something once, we can do it again and again and again very easily.

Data "in the wild" routinely embeds multiple layers of encoding. For our final example, count the distinct encoding layers in the following snippet of HTML:

Script: <script language="javascript">
  document.getElementByID("sample").innerHTML = 
    "<pre id="one&amp;two">one\n&amp;\ntwo</pre>"
</script>

There are no fewer than nine distinct encoding layers in this little snippet:

At the bottom, we have the particular character encoding the HTML itself is encoded in. My personal favorite is UTF-8. Since I stuck with 7-bit-ASCII-compatible characters in my example, we can say that's what it is. Below this layer lies the privileged code points only; this is what was sent over the TCP connection, which is as far down the encoding rabbit-hole as I'd like to go today..
Next up, we have PCDATA, which carries the text "Script: " itself. This defines some escape sequences based on & and ;, like < for "less than", so that ASCII/UTF-8/whatever "less than" doesn't collide with the next layer, and for convenience for people writing HTML by hand like é (é). This is the layer that, as humans, we think of as the text once we learn to read HTML.
Tag names and attribute names are in a separate encoding, related to PCDATA but more constrained. For example, PCDATA can carry an encoded ampersand, but tag and attribute names may not carry an ampersand, encoded or otherwise. In practice, you may be better off just thinking of this as a constrained PCDATA rather than a separate encoding, but I believe it is more technically correct to view this as a separate encoding, and it's better to start with technical correctness and work your way to a practical understanding than to try to go the other way.

I'm also going to include the whitespace separating attributes here, as there's no gain to considering it separately. This layer also includes the equals sign, and the quote and apostrophe characters for delimiting the attribute's PCDATA layer, but not the attribute value data itself.
The attribute values, if properly quoted, contain another PCDATA layer. HTML has traditionally been somewhat looser about literal < and > values being encoded directly into ASCII(/UTF-8) in the attribute values than it should have been; you really should always encode them as you would any other PCDATA.
Now, we get to the Javascript encoding layer, which encodes the Javascript code. This is a CDATA layer, although the claim that document makes about the CDATA being "unparsed" is not 100% true, just nearly so. In reality, the HTML processor does need to parse the CDATA just enough to look for the closing script tag. Since this is CDATA and the HTML parser does not understand any encoding layers contained inside, the HTML parser is forced to look for the literal string "</script" (case insensitive), which will close the script no matter what.

That is, even if it looks like <script>var mystring = "</script>"</script>, the first instance of the closing tag is what the HTML parser will see, resulting in what humans may consider junk content and what the computer will consider syntactically incorrect Javascript (a statement ending with an open-quote). (To have some fun with websites online, especially Web 2.0 sites, if you see that they are putting your value in HTML JS (not XHTML JS), try slipping </script> tag into your input to see what happens.

The Javascript encoding layer carries us up to the first quote mark.
Next is the Javascript string encoding layer, which is how you describe strings that may contain arbitrary values in Javascript. The first instance contains the uninteresting string "sample". The second one is more interesting, because it contains stuff destined to be set as the .innerHTML of some HTML element, causing it to be parsed as HTML itself.
PCDATA again for the internal content. (I don't think we have to re-count the base character encoding because I don't think there's any way to use a different one at this layer.)
Tag names and attribute name encoding again.
The PCDATA layer for the attribute names.

How Many Layers?!?

Quite a few of your are probably shaking your head and wondering how the hell I found 9 encoding layers in a simple snippet of HTML. The reason that it may seem so surprising is that by design, most of those layers are designed to be lightweight, so a human can use them. For instance, the words "one" and "two" manage to pass all the way from the innermost layer out to UTF-8 totally unscathed; that is to say, the byte string in straight UTF-8 that represents those words is the same byte string that represents those words through all 9 layers of encoding.

If I wanted to be even more pedantic, I could probably find another encoding layer or two in the Javascript grammar definition - remember, if you have a different set of allowed code points, by definition you have another encoding function. It may be defined very similarly to other encoding functions, but it's still not the exact same function.

Note the newline didn't fare so well; the Javascript string layer had to add a JAVASCRIPT_STRING_NEWLINE, a.k.a. \n, because NEWLINE is already a semi-statement-separator in Javascript and can't be directly used in strings.

I hope that by know you understand why managing encodings is so difficult. It's very easy to mis-match all of the PCDATA layers, or while programmatically generating deeply-nested text, to forget one of the encoding layers. (After all, it took me a couple of tries to get the example right myself. I have a disadvantage in that I'm actually working one PCDATA layer deeper than you are viewing it at, since I'm writing the HTML by hand.) If you're lucky, you'll get a syntax error in the Javascript. If you're unlucky, when the user gives you malformed input, your page gets misparsed and you end up with a mess. Or worse.

So What?

At least five times out of ten, someone who advocates correct escaping will hear some variant of "so what?" So what if we get it wrong? So the user might not be able to enter some values, or some things will be misparsed. So what?

Well, if you're a serious programmer, the word "misparsed" ought to already be sending chills down your spine. And to that I'll add my "data destruction" argument from above.

But even that may not be enough to rattle your cage. So let me take you on a whirlwind tour of what can happen to you when you don't manage your encoding correctly.

XSS (Cross-site scripting) Vulnerabilities

XSS vulnerabilities are encoding failures. They are the same basic problem as the CSV problem discussed at length early, code that looks like the following:

print "<textarea>" + Querystring['Username'] + "</textarea>"

which crosses layers 1 and 2 (base UTF-8 and PCDATA) as numbered above, causing the contents of the Username parameter of the querystring to appear as raw HTML in the resulting HTML. This allows an attacker to insert arbitrary HTML into a page, including arbitrary Javascript, allowing an attacker to re-program forms to send confidential data to their own servers or a wide variety of other mischief. Consult Google about Cross-site scripting to find out more about what it can do; it can often be parlayed into full control over a website with a bit of persistence and luck, depending on the environment.

SQL Injection

SQL Injections are encoding failures, the moral equivalent of:

sql.execute("SELECT * FROM table WHERE " + sql_clause)

The problem here is that you're mixing the SQL command encoding layer with the SQL string encoding layer. It's (usually) OK to allow user input in a correctly-encoded SQL string, but letting it directly into the SQL command encoding layer can, in the worst case, result in total data destruction when sql_clause is "1; DROP DATABASE your_database".

Other Command Injections

SQL injection and XSS are just special cases of data making its way up to a command encoding layer. There's plenty of others. For example, the Sun Telnet vulnerability is a shell injection vulnerability, where user input is passed in such a way that a program sees it as a command line argument and not data. The shell is particularly tricky to deal with, because it has relatively complicated escaping procedures (created by accretion by people who I believe didn't really understand what I've discussed in this essay), and the penalties for failing to encode things correctly is often arbitrary command execution through multiple techiques (stray semi-colons, stray ampersand, backticks, and that's not a complete list) or thoroughly unintended behavior like logging in root without a password check.

Unexpressable Character Strings Or Other Data Destruction

There is a commercial PDF library I've used at work, which is otherwise a quality piece of work so I'd rather not name it. It uses tags similar to HTML to define inline text styling controls, much like HTML. Unlike HTML, they did not seem to include a way to insert a < directly into the text, as HTML does with <.

Since we needed to put arbitrary text in our PDFs, sometimes including <, we had a problem. Eventually we we able to find a workaround: There was a command for changing which character was used for starting a tag. We created an escaping routine that changed every < into the command to change the start-of-tag character into something else, then a <, then change in back. This worked, but it would have been cleaner if the library shipped with an escaping for < like HTML.

I've used a number of libraries that destroyed data like this. I've been to a number of websites where if you post a comment with <, then click "edit", then click "submit" with no changes, the content of the comment would change because somewhere there was one too many encoding/decoding calls.

Data destruction isn't necessarily a security problem, although it can be if your data destruction happens to cause subsequent errors, but it's certainly something to be avoided.

</script in HTML is an example of this; you can't directly express </script in a script in HTML. (You can in XHTML, because the script is correctly pulled up one escaping level, which results in a complete encoding.)

Other Encoding-like Things

Interestingly, other things can be viewed as encoding-type issues as well.

I think localization is best viewed as encoding computer-type symbolic messages into human languages. I really approve of the approach taken by Perl's Locale::MakeText, which I think makes this approach easy. (See also this language-indepedent discussion of localization attached to that module.) I think it's a mistake to view it as a translation of English (or the initial language used) into other languages. It's OK to use phrases in your native language as key for these symbolic phrases, but it encourages dangerous misapprehensions, similar to the problem of thinking that ASCII is real.

A code point in this scenario would be some symbolic message that carries data about the message in itself. For instance, (FILE_COULD_NOT_BE_OPENED,'C:/dir/yourfile.txt',FILE_NOT_FOUND) could be translated as "File not found: C:/dir/yourfile.txt".

Dealing with time zones can be seen as an encoding thing. Just as a byte string is not well defined in meaning without an encoding definition applied to it, a specific time is not well defined without a time zone attached to it. There is no such thing as "11:00 p.m.", only "11 p.m. UTC" or "11 p.m. EST". However, there is no equivalent to nesting time zones, and no equivalent to "decoding"; you just "encode" directly into the time zone you want. Despite the fact this is much easier than encoding, a lot of programmers get this wrong too and will pass times or dates back out of their libraries or frameworks that have no (correct) timezone, or don't correctly handle timezones, because they don't seem to realize that a time without a timezone isn't a time at all, just as a text string without an encoding isn't a text string at all. (It is somewhat true that at the moment a time enters the system, you can assume the local timezone, but no code after that can.)

Data with units can also be seen as an encoding; converting from inches to feet can be seen as an encoding conversion. The equivalent of an injection attack is "converting" a meter to an inch without applying the proper conversion factor; NASA of course famously did this, but they are far from the only ones, and you can get yourself into trouble even in pure metric if you "convert" meters to kilometers incorrectly.

We Do Have A Problem

This all represents a big problem.

On the one hand, failing to correctly manage encoding is either the number one source of security vulnerabilities, or number two trending towards number one. As we add more and more encoding layers into our systems, it becomes more vital that each layer work totally correctly, ideally totally transparently, and that every program handle each encoding layer totally correctly.

On the other hand, correctly managing encoding, all the time, at all layers, is extremely difficult. (Initially I had written "extraordinarily" difficult, but regrettably, this difficulty is all-to-ordinary.) My experience suggests it requires a well-above-average developer to even recognize that this is a systematic problem. It takes a skilled developer to merely get it right most of the time. I think that in current environments, it requires a superhuman developer to get it right all of the time.

However, we do not have superhuman developers.

I've seen some solutions based on using a type system to indicate encoding levels; I'm not convinced this is the right way to go. Type systems are great ways to block somebody from doing something with high reliability, but in fact there isn't any particular operation that a programmer may want to do that can be safely banned for all cases; I can't think of an equivalent blanket statement you can make about encoding that is always true the way you can safely say "You will never want to reach outside of an array with an array reference". Perhaps someone smarter than me can come up with a useful invariant that can be enforced. If you think you want to build such a type system, be sure you can do something reasonable with my example of 9 layers of encoding; if you can't handle that, or if it is more complicated than just doing it by hand, you've lost.

I think the base problem here is that tagging strings focuses on the wrong aspect of the problem; inside a program, strings should always be stored as directly as possible, not half-encoded. The thing to focus on is the encoding function stack and the output functions and making sure that the grammar of the composite encoding is correctly maintained, not on labeling the strings themselves.

I don't have a simple solution to this. My best solution is at least to not sweep the problem under the rug, to acknowledge that this is a real problem and is not trivial, and when applicable default to a safe encoding level (i.e., if concatenating HTML together, encode for PCDATA by default, require the programmer to explicitly ask for the unsafe raw ASCII/UTF-8), but this is not always possible or easy.

But I can say this with confidence: Improper encoding and improper understanding of encoding is one of the largest problems facing the programming world today.

Addenda

PS: How can I claim encoding errors are the number one cause of problems "in many modern languages"? Buffer overflows are clearly the number one problem in the real world, historically, but we are conquering them by writing code in languages that effectively can't have buffer overflows. They should be trending down as we transition away from buffer-unsafe languages. It will take a while, though, and there will likely always be a "bottom layer" that can still be screwed up.

PPS: There's a ton of stuff to nitpick in this essay, like the fact that ASCII is 7-bit, not 8-bit (but note I stayed in 7-bit land), that I'm not quite using "code point" in exactly the way that Unicode does, that "bytes" aren't the only way to look at what a computer can store, that what we call "a byte storing 38" is itself an encoding of electrical signals and not "truly" 38, that in Unicode it's more proper to talk about mapping language strings to other language strings because of a variety of complexities having to do with how code points can interact (see the normalization standards), that the CSV I define doesn't match the RFC (though IMHO my CSV is better, or it would be with one additional specification about what to do with bad escape sequences), that encoding character-by-character has awful performance implications in Python (and isn't how I'd actually do it), and any number of other fiddly corrections. (Also, for the purposes of discussion, I'm defining my own concept of encoding, which may or may not match any other concept drawn from any of the discussed domains.) But before you go complaining to me about it, ask yourself if your criticism would make the point clearer, rather than muddying it up with caveats, exceptions, and clarifications that only make sense after you already understand the idea. This post is like the Newton's Laws of encoding to the Relativity of reality. (I know this is a problem, because I started with those caveats, and it just made the article worse.)