What Every Programmer Needs To Know About Encoding

a work in progress - Jul 4, 2006 - on the off chance you find this please do not link yet.

In many modern, mainstream languages, failure to understand encoding issues are the number one cause of security flaws in software.

The previous reigning champion of security flaws, the buffer overflow, was solved by creating languages where it is effectively impossible to write buffer overflows. Between these languages, and improving defenses at the hardware level, the number of buffer overflows either now is or soon will be trending ever downward.

With encoding issues, we are still in the 1980s... widespread denial that there is a problem, and a strong belief that the solution is Just Try Harder.

Note that this is long because if you don't have a deep understanding about what is going on, you too will write encoding-based security vulnerabilities. Given the state of ignorance about this situation, I do not believe I can make this much shorter.

But before I can discuss any sort of solution, what exactly is the problem? Let us start with a story from the ancient days of computing.

The First Encoding Error

In the beginning was the Flat Text File, and it was good.

Todo:
 
* Eat breakfast.
 
* Eat lunch.
 
* Flunk all my students.
 
* Sleep.
 
* Repeat.

And there was reading and there was writing, and it was the First File Format.

And lo, the Accountant did receive word of this First File Format, and he did come to the Programmers and declare, "Behold, I require the storage of columns of data, recording the flow of gold and jewels as they flow hither and yon." And the First File Format did beget the Second File Format, the Comma Separated File.

Consolidated Consolidations, 11/22/1823, $-45.33
Limited Limits, 11/23/1823, $33.48
Microscopic Microscopes, 11/23/1823, $19.73

And there was reading and there was writing, and it was the Second File Format.

And the Accountant did record the flow of gold, and frankincense, and Michael Jackson albums, and figs. But the Accountant was wroth, for he did enter the number of three thousands and eight-score and five and forty-three centimes, and suddenly of columns there were four:

Figgy Fig Figs, 11/24/1824, $3,165.43

The Account's mighty software saw that transaction as $3, and the error was numbered three thousands and eight-score and two and forty-three centimes, which was many.

And thus was born the First Encoding Error, and the land of the Programmers did fall into darkness and disrepute, where they remain until this day. And there is much wailing and gnashing of teeth.

Understanding The Problem

OK, so flat text and CSV weren't the first and second file formats. Dramatic license and all that.

Hopefully it's obvious what went wrong in that story. What's less obvious is that there are several ways of thinking of the problem, and that some of those ways are much better than others. Unfortunately, the simplest way of understanding it, something like "the computer misinterpreted the comma as a delimiter" is also the least enlightening.

Encoding, defined

In Computer Science, there is a definition of the term "language", which is a set of ordered character sequences that are "valid" strings in that language. "Characters" are abstract entities, which can technically be anything.

For example, let's use "case-insensitive English words" as our example language. We would understand the characters as the letters "a" through "z", and the "language" is the set of words that are valid English words, like "cat", "dog", and "houston". The string "blecrec" does not belong to the English language; it is "invalid".

Now, let us suppose we want to represent the word "cat" in a computer. A computer's memory can not store the character "c", because a computer's memory can not store abstract entities. A computer can only store and manipulate numbers, specifically, the numbers 0 through 255. (For concreteness, let us consider only bytes for now.) These numbers are the only characters the computer understands, and that makes it a uniquely priviledged character set; everything we do must ultimately come back down to these characters or a computer incapable of dealing with it.

(Picture: Letters on one side, Numbers on the other. Computer numbers are black and white. Letters are desaturated blue on light yellow.)

It is important to understand these as two distinct things: On the one side, we have the characters in the English alphabet, on the other, the numbers a computer can store.

What the computer needs is a mapping between these two sets. Encoding is the process of applying the mapping from one character set to another.

One popular encoding of the English character set is ASCII. Using this encoding, for the string "cat", we get [99, 97, 116]. Note that [99, 97, 116] is not itself "cat". [99, 97, 116] is to the computer exactly what it looks like, a series of numbers. It is merely the ASCII encoding of "cat".

(EBCDIC: phosphor green, old-school font)

There are many possible encodings of "cat". Another standardized encoding that includes English characters is EBCDIC, although it is now obsolete. With the EBCDIC encoding, "cat" is encoded as [131, 129, 163]. This is every bit as much "cat" as [99, 97, 116] is, viz. not at all.

(Note: When I say ASCII in this document, please read it as any of the extended ASCII variants that use all 8 bits. It doesn't matter which since my examples will only use standard ASCII characters.)

Characters in an encoding

Most of the characters we are familiar with in ASCII are the concrete characters of the real world that we can fool ourselves into thinking are "real", like the letter A. You can believe it is possible to hold a letter "A" in your hand.

You actually can't, but you can believe you can.

We often confuse objects in the physical world with the information they represent, confusing the signifier with the signified. (Hopefully by the time you are done reading this you will see that even a carved letter A, such as you might see in a nice sign, is itself just another encoding.)

But there are other characters in ASCII, one of the "most concrete" encondings there is, that are completely abstract. How do you hold a "line feed" in your hand? How do you hold a "beep"? And perhaps my personal favorite, how do you hold a NUL in your hand? A Zen riddle for the modern age.

Even this simplest of encodings contains abstract characters that either have physical meaning only as an action (LF, "line feed"), or can't be said to have any physical meaning at all (NUL).

(Picture: The CSV set of characters: some other color)

Getting back to our example, the CSV file format has a character in it that we can call the VALUE_SEPARATOR. This VALUE_SEPARATOR, despite the name, is not a "comma". It is an abstract entity that must be encoded somehow in order for the file to be written to disk. There is also the ROW_TERMINATOR character, usually the local newline character or charecters.

Wrong Solution #1

The CSV file format has two characters it needs to encode, VALUE_SEPARATOR and ROW_TERMINATOR. The traditional encodings are "the ASCII comma" and "the local ASCII newline". For concreteness, we will use the Unix standard "line feed" character 10.

The natural solution for a beginning programmer, which is what was used by our hapless programmers in our tale of woe, is the following (in the "working psuedocode" known as Python):

def writeLine(fields): print ",".join(fields) + "\n"

Which you can enter into your friendly local Python interpreter. If you do, and you feed it:

writeLine(["Figgy Fig Figs", "11/24/1824", "$3,165.43"])

You will see printed:

Figgy Fig Figs,11/24/1824,$3,165.43

Now, we finally get to the root of the conflict. Both the ASCII comma, such as the Accountant tried to use in his number, and the VALUE_SEPARATOR in the CSV file format mapped down to the same final value, 44. But when the CSV parser reads the file back in, it must assume that every comma is in fact a VALUE_SEPARATOR. The information about the ASCII comma is gone.

(Wrong picture: CSV interleaved with ASCII, going to numbers.)

When the Accountant's program later went back to read that line, it is not correct to say that the CSV parser is "misinterpreting" the "$3,165.43" as two values. That implies that the information is present in some sense, but the CSV parser is too dumb to figure it out. The information is no longer there; even a human can not look at "$3,165.43" and be sure that what is intended is three thousand, and not three dollars. A human can make a good guess, but it is still a guess, and we don't really need people guessing when it comes to accounting information.

By the time the file hit the disk, it was already corrupted. Two distinct character sequences, "($3) VALUE_SEPARATOR (165.43)" and "($3,165.43)" have been mapped down to the same encoded value sequence, and there is no longer any correct way to tell them apart. (I use the parentheses to indicate the "value" level of the CSV file.)

"There is always an easy solution to every programming problem — neat, plausible, and wrong." (quotation mutated from the collective conciousness)

I am going to say this encoding is ambiguous: There is at least one instance of two distinct character strings in the original data that, when encoded, result in the same character string in the new encoding. In this case, that is the value-level comma and the VALUE_SEPARATOR.

Wrong Solution #2

(Picture: Forbidden commas)

The root problem above was trying to jam 258 characters into 256 slots; by the pigeonhole principle we know that can't work. So, the next most easy thing to do is to carve out 2 characters from our set and declare that they will encode our two special CSV characters, and they are forbidden from being contained in the values themselves.

If you happen to have a copy of Excel or the Open Office spreadsheet lying around, it can be instructive at this point to pull open the dialog that imports a CSV file, and poke around with the bewildering array of options you have for delimiting fields.

This solution is less wrong in that it at least does not build ambiguity right into the specification; it's completely specified. The problem is, even with CSV, which is pretty minimal as file encodings go (two extra characters over flat text), there are no two characters that you can say to your users in general "You may not use these two characters in your files."

First, obviously, commas are pretty useful, so you're not going to want to use those. The import dialog will show you just how many other delimiters have been tried. Tab is the most common. You can't get much more exotic than that, though, or you lose one of the major benefits of a CSV file, which is that you can profitably edit it as text. Any character you can type to use as the value delimiter, you can also want to type to use in the value. The same concern holds with the newline for ROW_DELIMITER, too.

Second, you just never know what those wacky users are going to want to do and what you'll have to do to enable it. While I'll admit I've never seen it, I'd lay money that somebody, somewhere has embedded binary files like images into a CSV file. Maybe it wasn't the best solution. Maybe it was just a hack. Maybe, just maybe, if I knew all the details, I'd agree that it was the best solution to the problem. But regardless of the reasons for the embedding, once you've got a binary file, all bets are off; the "value" can contain anything, even nulls. No matter what two characters you try to reserve from ASCII, the binaries will contain them.

Even if you think this works in your particular case, it doesn't. And even if you still think it works, you're better off creating a file format with a better encoding system anyhow so that you won't find out the hard way that no, it didn't work in your case after all.

This encoding solution is incomplete: There exist character strings that you may want to represent that this encoding does not allow at all. This is only acceptable when you can guarantee that those character strings will never be desired.

Layered Encodings

We want the full ASCII set available to us. We want the full CSV set available to us. 258 values, and only 256 different values for bytes. We can't ignore the fact there are too many values, and arbitrarily cutting down the values to fit the characters is not practical. The only option left is to increase the number of characters we have to play with.

There are a number of ways to deal with this. When dealing with arbitrary text, the most popular is escaping. Many varients of CSV, along with HTML, XML, and most other text-based formats will use some variant of this.

The simplest escaping technique is to choose an escape character. I shall stick with tradition and select the ASCII backslash, which looks like: \.

If we want to use the ASCII backslash, we need to introduce another layer of ASCII into the system.

ASCII

CSV

ASCII

Bytes

Theoretically, we could go straight from CSV to bytes, but the problem with that is that in the following discussion, I wouldn't be allowed to refer to things like "ASCII BACKSLASH". I could only refer to "92", and there's two problems with that. First, as humans we can't really read raw numbers and have any understanding of them, so it's not very educational for me to refer to them.

Secondly, even in our programming languages we tend not to refer to raw numbers; we're more likely to actually have the \ character appearing on our screen as a component of some string, and allow the compiler, program, and enviroment to work together to convert that character to a number behind the scenes. So, along with allowing me to say things like "backslash" without being inaccurate, it also more accurately describes how any CSV writer you may write is really working.

Incidentally, with this diagram we see yet another way of looking at our encoding problem: It is easy to get mixed up about exactly which ASCII encoding level you are on.

So, let's do the CSV -> ASCII encoding:

The ASCII character COMMA gets replaced by ASCII's BACKSLASH COMMA, two bytes.
The ASCII character NEWLINE gets replaced by ASCII's BACKSLASH NEWLINE.
The ASCII character BACKSLASH gets replaced by ASCII's BACKSLASH BACKSLASH.
The CSV character VALUE_SEPARATOR gets replaced by ASCII's COMMA.
The CSV character ROW_TERMINATOR gets replaced by ASCII's NEWLINE.
The remainder of the ASCII characters get replaced by the ASCII character they represent.

This process converts the CSV file into a standard ASCII file. I have represented this as a series of string replacements, but it is trivial to convert this to a stream processor.

To get down to the base byte level, now we can just lean on the standard ASCII -> Byte encoding.

This suffers none of the disadvantages of the previous two wrong answers. It represents all characters unambiguously and completely. There is the minor issue of what to do with "illegal escape sequences" in the bottom-level ASCII representation, that is, backslashes followed by something other than a backslash, new line, or comma, but that is a minor issue.

The reason I feel free to label the two wrong solutions above as actually "wrong" and not just "a bad idea" is because the correct solution is not that much harder. Only in the rarest of circumstances would you be justified in taking the risks entailed by using one of the wrong solutions.

Delimited Encoding

The other basic way to delimit nested encodings is to declare in advance how long the encoded field is in some non-ambiguous measurement. For a great example, see the BitTorrent distribution file format, where a string appears in a .torrent file as an ASCII number indicating the length, followed by a colon, followed by the string itself. For instance:

22:Any, chars: 3: go here

represents the string "Any, chars: 3: go here", and any correctly-working parser won't be confused by the embedded "3: go" and think that represents a new three-character string.

This encoding technique is generally a little easier to read and write for computers when you can read or write the entire file at a time, but it's virtually impossible for humans to read or write by hand, and does not stream as well because it's too easy to end up with a large entity that you can't process with confidence until it completely comes in.

If you fix the length of the field in the specification of the file format itself, then you have record-based encoding, which is a very popular thing to do when storing binary data. The traditional "dump of C data structures" saved file uses this.

It also tends to provide some interesting opportunities for attackers to fiddle with the length sizes. If you can't 100% trust the user input, you can't trust the length specifications either. However, length specifications that aren't 100% trustworthy means you end up having to verify the length specifications somehow anyhow, so you often lose out in the end, especially in the era of readily-available XML. Unless you know you need length-delimited encoding, I'd recommend not trying to use it.

Wheels Within Wheels Within Wheels...

The wonderful thing about programming is that once we do something once, we can do it again and again and again.

Data "in the wild" routinely will embed multiple layers deep. Count the distinct encoding layers in the following snippet of HTML, assuming we have some element with the ID of "sample":

Script: <script language="javascript">document.getElementByID("sample").innerHTML = "<pre id="one&two">one\n&\ntwo</pre>"</script>

There are no fewer than nine distinct encoding layers in this little snippet:

At the bottom, we have the particular character encoding the HTML itself is encoded in. My personal favorite is UTF-8. Since I stuck with 7-bit-ASCII-compatible characters in my example, we can say that's what it is.

This layer, much like the CSV examples above, actually contains very few characters, mostly <, >, &, and a few other sundry escape characters, which you can identify by the need to escape them in the next layer.
Next up, we have PCDATA, which carries the text "Script: " itself.
Tag names and attribute names are in a separate encoding, related to PCDATA but more constrained. For example, PCDATA can carry an ampersand, but tag and attribute names may not. In normal practice, you may be better off just thinking of this as a constrained PCDATA rather than a separate encoding, but I believe it is more technically correct to view this as a separate encoding, and it's better to start with technical correctness and work your way to a practical understanding than to try to go the other way.

I'm also going to include the whitespace separating attributes here, as there's no gain to considering it separately. This layer also includes the equals sign, and the quote and apostrophe characters for delimiting the attribute's PCDATA layer.
The attribute values, if properly quoted, contain another PCDATA layer. HTML has traditionally been somewhat looser about literal < characters in the attribute values than it should have been; you really should always encode them as you would any other PCDATA.
Now, we get to the Javascript encoding layer, which encodes the Javascript code. This is a CDATA layer, although the claim that document makes about the CDATA being "unparsed" is slightly exaggerated. In actuality, the processor does need to parse the CDATA just enough to look for the closing script tag. Since this is CDATA and the HTML parser does not understand any encoding layers contained inside, the HTML parser is forced to look for the literal string "</script" (case insensitive), which will close the script no matter what.

That is, even if it looks like <script>var mystring = "</script>"</script>, the first instance of the closing tag is what the HTML parser will see, resulting in what humans may consider junk content and what the computer will consider syntactically incorrect Javascript (a statement ending with an open-quote).

The Javascript encoding layer carries us up to the first quote mark.
Next is the Javascript string encoding layer, which is how you describe strings that may contain arbitrary values in Javascript. The first instance contains the uninteresting string "sample". The second one is more interesting, because it contains stuff destined to be set as the .innerHTML of some HTML element, causing it to be parsed as HTML itself.

From the point of view of the HTML parser, this is the end of the encoding stack. The string inside the <script> tag is passed off to the Javascript interpreter. This is also the end for the Javascript intepreter itself, until it finally executes and parses the Javascript string as HTML.

However, from the human point of view, the encoding stack continues on into the Javascript string itself. And as it is the human point of view we care about, on we go. Fortunately, it's just repeats of the HTML encodings above.
PCDATA again for the internal content. (I don't think we have to re-count the base character encoding because I don't think there's any way to use a different one at this layer.)
Tag names and attribute name encoding again.
The PCDATA layer for the attribute names.

Diversion: XHTML

As alluded to in the CDATA discussion, HTML has a weakness in that it can't really tell when a script tag "should" end; all it can do is look for the first thing that looks like a close tag. In other words, HTML has some built-in encoding-layer confusion. There are some other things that have traditionally proven problematic or underspecified, such as the inclusion of a </textarea> tag in a textarea. (If you want to test your favorite "text box to enter HTML" on the web, try adding a </textarea> and see if it works.)

XHTML fixes that by making the encoding layers much less ambiguous by building on top of XML. I don't want to redo the entire discussion about layers, but I would like to show what the previous example looks like in XHTML. First, the HTML snippet again:

Script: <script language="javascript">document.getElementByID("sample").innerHTML = "<pre id="one&two">one\n&\ntwo</pre>"

And now as an XHTML snippet, escaping the XML escape characters:

Script: <script language="javascript">document.getElementByID("sample").innerHTML = "<pre id="one&two">one\n&amp;\ntwo</pre>"</script>

Raw < characters would represent a new start tag to the XML parser and not get passed to the Javascript interpreter correctly, so they have to be escaped. Alternatively, you can use XML CDATA:

Script: <script language="javascript"><![CDATA[document.getElementByID("sample").innerHTML = "<pre id="one&two">one\n&\ntwo</pre>"]]></script>

The XML CDATA directive tells the XML parser to pass the text through directly, without parsing it as XML. The price is that, again, the first instance of ]]> terminates the CDATA section, regardless of the syntax of any contained data.

How Many Layers?

Quite a few of your are probably shaking your head and wondering how the hell I found 9 encoding layers in a simple snippet of HTML. The reason that it may seem so surprising is that by design, most of those layers are designed to be lightweight, so a human can use them. For instance, the words "one" and "two" manage to pass all the way from the innermost layer out to UTF-8 basically unscathed. (Although note the newline didn't fare so well; the Javascript string layer had to add a JAVASCRIPT_STRING_NEWLINE, a.k.a. \n, because NEWLINE is already a semi-statement-separator in Javascript and can't be directly used in strings.)

For each layer I mention, it either adds or further restricts the characters that can be represented.

I hope that by know you understand why managing encodings is so difficult. While it is true that you'd have to go out of your way to get some of the encoding layers screwed up, it's very easy to mis-match all of the PCDATA layers, or while deeply nested, to forget one of the encoding layers. (After all, it took me a couple of tries to get the example right myself. I have a disadvantage in that I'm actually working one PCDATA layer deeper than you are viewing it at, since I'm basically writing the HTML by hand.) If you're lucky, you'll get a syntax error in the Javascript. If you're unlucky, when the user gives you malformed input, your page gets misparsed and you end up with a mess. Or worse.

So What?

At least five times out of ten, someone who advocates correct escaping will hear some varient of "so what?" So what if we get it wrong? So the user might not be able to enter some values, or some things will be misparsed. So what?

Well, if you're a serious programmer, the word "misparsed" ought to already be sending chills down your spine. And it's also worth pointing out that when dealing with multiple layers, if every layer isn't unambiguous and complete, bugs in a lower layer can result in extremely difficult-to-track-down bugs that may also be completely unfixable, if the lower layer is a library you can't fix.

But even that may not be enough to rattle your cage. So let me take you on a whirlwind tour of what can happen to you when you don't manage your encoding correctly.

XSS (Cross-site scripting) Vulnerabilities

XSS vulnerabilities are encoding failures. They are the same basic problem as the CSV problem discussed at length early, code that looks like the following:

print "<textarea>" + Querystring['Username'] + "</textarea>"

which crosses layers 1 and 2 (base UTF-8 and PCDATA) as numbered above, causing the contents of the Username parameter of the querystring to appear as raw HTML in the resulting HTML. This allows an attacker to insert arbitrary HTML into a page, including arbitrary Javascript, allowing an attacker to re-program forms to send confidential data to their own servers or a wide variety of other mischief. Consult Google about Cross-site scripting to find out more about what it can do; it can often be parlayed into full control over a website with a bit of persistence and luck, depending on the environment.

SQL Injection

SQL injections are encoding failures.

We Have A Problem

This all represents a big problem.

On the one hand, failing to correctly manage encoding is either the number one or number two source of security vulnerabilities, trending towards number 1.

On the other hand, correctly managing encoding, all the time, at all layers, is extraordinarily difficult. My experience suggests it requires a well-above-average developer to even recognize that this is a systematic problem. It takes a skilled developer to merely get it right most of the time. I think that in current environments, it requires a superhuman developer to get it right all of the time.

However, we do not have superhuman developers.

(Did you discuss how structured formats also work, and can reduce the escaping problems? Fixed-length fields. Mention that ASCII has only 128 characters and we really needed to specify one of the extended encodings.)