Validity of Values In Programming Languages

2023-01-31

Page content

A point touched upon in my Interfaces and Nil in Go, or, Don’t Lie to Computers that deserves full expansion is, what exactly does it mean for a value to be valid?

It is something that I think many would consider obvious at first, but if you dig into it… and programming generally does make us dig into things like this… it becomes less obvious.

But if you are one who thinks it’s obvious, riddle me this. I have a string containing the six characters “here's”.

Is it valid?

Well That’s A Stupid Question

So it is. But we often speak, or worse, program as if the answer is trivial.

Consider the (very bad) security advice that you should Sanitize Your Inputs, ideally spoken while nodding sagely and stroking your chin. “Sanitization” is a specific form of validity.

So sure. Let’s sanitize our inputs. Is here's “sanitized” or not?

If not, exactly how shall I “sanitize” it?

Hopefully you are quite irritated at this point and want to scream “How am I supposed to know that?” How are you supposed to know whether here's is valid? How are you supposed to know if it’s “sanitized”?

Without further information, you can’t tell.

Context Sensitive

And therein lies the point. Validity is intrinsically a context sensitive operation.

here's is a valid English word.

here's is not a valid email address.

here's is a valid thing for a text editor to find in a file.

here's is not a valid street address.

Validity is not a property of data. It is the property of data in some context. Without context, validity is undefined.

I said we often program as if this is not true, because how many IsValid() bool methods on some type are there out in the world? In general, it is easy to interpret such a function as making a promise it can not strictly speaking fulfill.

Consider that street address case. What will an IsValid() bool method do? Well, maybe it will consult some database of known-existing addresses to see if it exists. I’m not even talking about mismatches between that database and the real world. Let’s forget the Umptydozen Falsehoods Programmers Believe About Addresses for the sake of argument and stipulate a perfect such function. Even so, does the method do what it says?

Mismatched Contexts

Well… yes and no. The “yes” is obvious, but the sense in which it is “no” is very important for thoughtful programmers to understand. Validity in the context of “does this address exist” is an indispensible concept for a program to be able to manipulate; surely it lies at the foundation of any other specific concept of “validity” one may wish to use. But the problem is that the idea of “validity” implied by that function signature is only and exactly that concept of validity, only and exactly “does the address exist in my database?” This is because it takes no further arguments. (It probably shouldn’t take any arguments; I’m not saying it should. I’m just inferring what the function does from its type signature.)

It is common, to the point it is the normal case, for what the program cares about to not match the foundational concept of validity that a data structure can define in its own isolated context. For almost any real application, the existence of some address is the beginning of validity. You probably care about the country it is in in order to determine if it is a “valid address I can ship my product to”. Or you might care about whether it exists in a high rise if you want “a valid address I can provide lawn care service to”. Or any of an effectively-infinite number of other context-sensitive definitions of “valid” that your application cares about.

For another example, a tree structure can define a “legal tree” on its own terms, maybe balaced, maybe ordered, maybe other properties. But it can’t know anything about the validity of the data it contains in some other context for validity.

By no means is this a bad method. I might suggest instead Exists() rather than IsValid(), but that’s only in the context of this essay; if I found a method with this name in a code review I wouldn’t blink at it in practice.

I’m suggesting instead a mindset change for you, the reader, to think more deeply about what validity deeply means, and to attach the concept of a context to all “validity” definitions, because that’s where the understanding is important. If you understand, the code will naturally follow; if you don’t, it will always have chaos in it. It’s all about the mindset.

Pointers and Validity

C Pointers

The intrinsically contextual nature of validity is useful tool for all programming work, but the proximal cause of this essay is a profound difficulty people have in understanding validity for pointer types across various languages.

In the modern era, C is the baseline language. C is essentially what our CPUs are designed to execute; a little less true every year, but still substantially true today.

It is well known that in C, all pointers can potentially be NULL, and this value is invalid.

Once you have internalized the lessons of this essay, a little flag should pop up in your mental landscape after a sentence like that… “invalid… in what context?”

Because NULL in C is not merely “invalid”¹ in some generalized sense. It is invalid in the specific context of being dereferenced as a pointer. Possibly others as well, I’m not claiming this is an exhaustive list. But in that context it is certainly invalid; dereferencing a NULL pointer is not just ambiently a bad idea, it is something that will crash the program as hard as it can be crashed from the inside. The compiler/runtime/OS have very firm opinions about how illegal an operation this is.

But it is not invalid in some maximally general sense; if it were it would not make sense for it to even be representable in C code. Why have a value that is simply never “valid” available at all? So there are contexts where it is perfectly “valid”. Some functions deliberately return NULL to indicate various things; strchr returns a NULL char* if you search a string for a character that does not exist in the string. In this case, NULL may be invalid in the pointer dereferencing context, but it is a perfectly valid return value for this function².

There are plenty of other uses as well: A pointer indicating the end of a linked list, a value that was NULL in your SQL database, a value that was missing from the input, etc. In the context of C it has many contexts where it is perfectly valid.

A particularly tricky detail for understanding pointer validity in other languages is that when C pointers are used to implement objects³, usually the C object system needs a pointer to an object to point to the vtable for the object, so a NULL pointer is invalid in the context of calling a method. The relevance of this will be clear shortly.

Go Pointers

I don’t really like that Go used the word “pointer”, because it confuses people with C experience, which is, you know, a lot of programmers. Go pointers lack what I consider the distinctive characteristic of “pointers”, the ability to do pointer arithmetic. Go’s pointers are really more like references, in that they can’t be moved around.

And it is the context of “validity” in which we pay the biggest price for Go’s nomenclature, as programmers import their previous concepts of what constitutes a “valid pointer” in C, be it either

All NULL pointers are unconditionally invalid, which as previously mentioned is wrong but easy to hold in an unexamined manner, or
A richer view of the validity of pointers in C, that is still regrettably incorrect in the context of Go, or many other languages.

Go’s language rules for the validity of a pointer differ substantially from C’s. While on the surface level C and Go’s type systems may seem similar, one of their differences that doesn’t immediately surface from a simple understanding of the syntax is that in Go, a pointer always has its type associated with it. This is magic done by the compiler and runtime behind the scenes.

As a consequence, when Go resolves a method, unlike a C object system, the compiler/runtime does not need the pointer to point at anything in order to resolve the method. The compiler or runtime consult the type of the pointer and can resolve the method with that alone.

Consequently,

package main

import "fmt"

type SomeStruct struct{}

func (ss *SomeStruct) Print() {
	fmt.Println("hello!")
}

func main() {
	ss := (*SomeStruct)(nil)
	fmt.Println("ss is nil:", ss == nil)
	ss.Print()
}

is perfectly legal code which will compile, execute, and print

ss is nil: true
hello!

Moreover, this execution is completely safe. There’s no undefined or even dangerous behavior invoked. You might say it is not only legal, but moral in Go.

A nil pointer in Go is still invalid in the context of being “dereferenced”, but what constitutes a “dereferencing” is more nuanced in Go. A method call does not automatically constitute dereferencing it.

And thus, in Go, nil pointers are not invalid in the context of a method call.

This can be further confusing because there are plenty of contexts in which nil pointers are invalid….

What Is An Invalid Pointer Exactly?

… although it’s worth taking a moment to ask, very clearly, what exactly is an invalid pointer operation anyhow?

For the purposes of the rest of this point, I’m going to use this as a synonym for “the runtime will panic when this operation is performed”. It seems a reasonable definition, probably the most reasonable we can give without more context, but it isn’t necessarily the only one. Local areas of code or specific objects may have richer definitions.

It is also worth pointing out that this is not a generalized definition of validity, because determining if something crashes and is thus “invalid” is arbitrarily complicated. It is easy to create a non-pointer value that has an “invalid method” by this definition:

type HasInvalidMethod struct {}

func (him HasInvalidMethod) Invalid() {
     panic("dude, I'm invalid")
}

but this opens a difficult door to the world of decidability, which I’m not going to get into here.

We can settle for now for calling nil pointers invalid for certain operations and leave more complicated definitions of validity for end-code.

Defensive Programming Against `nil`

Most Go objects with pointer-based methods are written in a way that is not defensive of nil pointers; whether this is by convention or because a large number of Go programmers don’t realize that methods can be written on nil pointers, I do not know. But you see a lot of the equivalent of:

package main

import "fmt"

type SomeNumber struct{
	Number int
}

func main() {
	ss := (*SomeStruct)(nil)
	fmt.Println("ss is nil:", ss == nil)
	// This is not the line that crashes.
	ss.Print()
}

func (ss *SomeStruct) Print() {
	fmt.Println("My number is:"
	// THIS is the line that crashes.
	myNumber := ss.Number
	fmt.Println(myNumber)
}

This does indeed crash with a panic at runtime. But as the comments in the code say, it is not the method call that crashes. It is the attempt to dereference⁴ the nil pointer that Print received in order to get the Number member. The method call resolved just fine; it is the body that crashed.

I consider this dangerous, though the theory is worse than the practice. If a method might get called with a nil pointer, it isn’t guaranteed safe to crash at a random location in the method body depending on when it first got around to dereferencing the pointer. In practice, it tends to works out; it’s usually hard to do anything dangerous without any member access. As long as you are good about not having global variables, it’s often the case that the entire rest of the function up to the first dereference was a pure function anyhow. But it’s not guaranteed, and some thought should be spared in those cases where it may not be safe.

Determining Validity Requires The Context

On the one hand, that header is obviously just the thesis statement of the post up to this point.

But now I mean it in a slightly different sense; in order to determine the validity of something in some context, the code trying to determine the validity must have that context. Or “be in that context”, if you prefer.

Consider my address case. It is likely that the database lookup is provided as some library. As the programmers are writing that library, they do not have your context. They can’t have your context even in theory; from their point of view, your context doesn’t even exist yet. So it is temporally impossible for them to write a function for you that precisely represents “the locations we’re willing to ship our products”. They can’t, neither in theory or practice.

Similarly, the fundamental issue with “sanitizing your inputs” is that at input time, you do not have the correct context to make any decisions about validity. Today you may know that you’re going to inject the value into an SQL query without proper parameterization, and even if we stipulate that you correctly “sanitize” the input to stick it into your bad database code, you do not know future contexts. Your DB sanitization code left commas behind, because in your SQL strings they aren’t special, but then someone writes equally bad CSV export code that just slams the values out into the CSV file, and now commas are bad values too.

But you couldn’t have “sanitized” that “input” at the time. CSV writing was in the future.

This is the deep reason why the correct answer is to encode your output correctly. Only at output time do you now you have the context. You know whether you’re writing SQL, or a CSV, or HTML, or a Javascript value inside HTML, etc.

It isn’t that sanitizing the input is OK, but encoding the output is better. It is that sanitizing the input is actually fundamentally impossible. This doesn’t make encoding the output correctly easy; that is still a rich and nuanced discussion of its own. But it is possible, and possible is better than impossible.

Go `nil` Interface vs. Interface Containing a `nil` Pointer

And herein lies the final Go twist on this. The whole “nil interface versus an interface containing a nil pointer” is often framed as if having a function that would crack into the interface and tell you whether the interface contained a nil pointer value would solve the problem.

But it wouldn’t, because in general, code consuming an interface can not tell whether a nil pointer is invalid. It can’t, not even in theory, because it lacks the ability to validate “all possible types coming in” on its own terms. By consuming an interface, you are consuming values from types that may not even exist yet, from programmers who haven’t even started writing the code that is calling your code consuming the interface. Your code can not tell whether a nil pointer of some type it has not ever heard of and has no concrete knowledge about is a valid implementation of the interface.

One of the entire points of interfaces is to strip away the need for a bit of code consuming some interface to need to deeply understand the context of incoming values. You can also view this as the deep reason why encapsulation is a good thing in code. We need fences to limit the spread of “contexts” through our code, because needing to understand everything about all the contexts of every bit of code is a crippling disadvantage in a code base.

This is like the sanitization problem, only perhaps flipped around. The code using an interface lacks the context to know the details of the inner contents of the interface.

It is incumbant on the thing producing the interface value to ensure that the interface value produced is going to be “valid in the context of calling one of the methods of the interface”. A consumer of the interface should accept that isolation of the context and simply assume that calling a passed-in interface’s methods will have good results, and if they don’t, the responsibility is on the thing that created and passed in the interface value, for failing to correctly manage validity within its context.

Validity Is Vital

I think understanding validity in the way I’ve outlined here is one of the vital aspects of being a mature programmer. I’ve written this in the context of Go because that is the particular windmill I’ve been jousting with the programming community with lately, but as you can see the idea that you can’t blindly carry definitions of validity from one programming language to another is a very general idea.

I find it kind of interesting that you’d have a hard time pointing to the code I write and saying “Ah, here’s a place where this understanding informed jerf in the code, I can see that plainly.” It’s never that obvious. Yet it is pervasive in my code, along with a general design to write my code to strip away as much of this specialized “context” from as much of my code as possible.

This idea of context underlies a lot of other coding practices as well. Why are “pure” functions a good idea? Because they isolate context; you do not need to seek out the mutable values they depend on to do their job, and you can be confident it won’t mutate anything else. What variables a procedure can be changed by and what it can mutate is an essential aspect of a given bit of code’s context. Such context can even change what data is and is not valid; I have plenty of places where database entries determine the valid enumeration values for some other table in the DB, for instance.

Learning to see not just the lines of code on the screen, but the context they form is a vital skill for high-level programming. Learning how to manage and minimize that context will level up your programming skill, and learning to see it in the code you read will level up your code reading skill.

I don’t know if the C standard ever uses the word “invalid” around the NULL pointer, as the standard is rather pricey. However, even if it does call the NULL pointer invalid in the general sense, that simply means they are using the word for a different concept than I am in this essay. ↩︎
In the context of C, anyhow. Yes, yes, Option types and all that are great and all, but we’re in C here. This is a non-negotiable part of the bargain. ↩︎
“Any sufficiently complicated C program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of C++.” - Not Greenspun’s Tenth Rule ↩︎
In Go, the . operater on a pointer value automatically dereferences the pointer, being the rough equivalent of (*ss).Number in C. ↩︎