In his recent article on Color Basic String Theory, Allen Huffman raises some interesting points. It turns out that Color Basic (on the TRS-80 Color Computer) has a very straight forward way of handling strings but there are some details that arise from it that may prove a bit surprising. Additionally, there are requirements and limitations that drive some of the design of the string handling system. I’m going to go into detail on just how Color Basic handles strings behind the scenes.
I was going to start with a detailed background of everything but it very rapidly got off the rails, so I’ll just dive into the details as they stand.
Strings are handled in two parts by the interpreter. First there is the descriptor which contains the length of the string and a pointer to the string data. The length is an 8 bit unsigned value (negative string lengths don’t make sense, do they?) which gives a maximum string length of 255 bytes. The pointer is a 16 bit value simply because the address size on the Color Computer is 16 bits. That gives us a minimum of 3 bytes for a string descriptor. The string data can exist anywhere in memory though the interpreter has a few bits that break if the string is at a memory location usually considered to be ROM (notably the VAL() function).
As it happens, string descriptors are actually 5 bytes. The reason for this is due to the variable storage mechanism which stores all variables of any type in one big table. Because floating point values are 5 bytes and string descriptors are 3 bytes, every variable value (and array value) must be 5 bytes in size. This is to avoid having to have different variable table search code for strings and numbers. By having all values store in the same size space, quite a few things become simpler and it does save in code complexity and size in the ROM.
Since string descriptors are 5 bytes, you might wonder why the length is only 8 bits. That may be a holdover from other Microsoft Basic implementations for machines where 16 bit arithmetic is unpleasant. The storage space to hold a 16 bit string length is clearly there. Still, for most applications, strings longer than 255 bytes tend not to be immensely useful simply because any operation modifying a string will tend to need temporary space that is often the same size as the original string, or larger.
String Data Storage
We have seen that the string descriptor simpy contains a pointer to the actual string data. If the string is NULL (zero length), that pointer can be ignored. Otherwise, it can point anywhere in memory. Often, this will be a pointer to the actual string definition in the program text. This will be the case for a static string assignment. However, as soon as you do anything dynamic to it, or for a string that is dynamically generated, a copy of the string has to be allocated somewhere. That somewhere is called “string space”.
If you’re familiar with a language like C, you might expect this to simply come from the same memory pile as the variable table, a pile known as the “heap”. That, however, is a very inefficient way of allocating small memory chunks because the allocator has to keep track of all the allocated memory chunks. It also leads to fragmentation of the heap when there is a lot of allocation and deallocation going on. While this is not a significant problem on modern systems, it is definitely a problem when you have a maximum of 32K of RAM to play with to store everything from the program text to the variable table to video memory. Instead, Color Basic uses a simple list of entries, each of which is 7 bytes, for the variable table which is immediately followed by the array table. Then you have free memory and the stack. Above that is the allocated string space. This eliminates all of the overhead of a typical general purpose memory allocator. It does have the drawback that you need to pre-allocate the string space you need for the entire runtime of your program, though.
Now, it wouldn’t be helpful if the string space (which is really another heap) still had all that overhead of a general purpose allocator. Instead, the interpreter simply allocates space it needs sequentially and doesn’t bother accounting for space that is no longer needed. That means there is essentially zero overhead for allocation (a couple of pointers is all). It also means that if there is enough space that hasn’t yet been allocated, a string space allocation request is an O(1) operation. However, this method means that when you no longer need a chunk of string space, it does not get returned to the system. You might think it is a bad way of doing things, but if you do track a free list, you now have the unenviably complicated task of finding a best fit “hole” for any allocation request. That takes a nontrivial amount of code and penalizes every allocation. It also doesn’t prevent memory fragmentation which means you will lose out on available space after a while or you will have to waste space for minimum allocation sizes. This is not ideal given the number of 1 or 2 character strings that tend to be tossed around.
Instead, when string space allocation fails, Color Basic runs a process called garbage collection. What this does is compact all strings currently allocated in string space into a contiguous blob, thus returning all free space to the “leading” edge of string space. Unfortunately, this process requires scanning through all extant strings (all variable, array, and temporary string descriptors) for the lowest pointer address. Then it moves that string data to the start of the uncompacted string space. It repeats that process until it doesn’t find another string descriptor. It ignores strings pointing outside of string space. In a worst case, this can be an O(n²) process where n is the number of extant strings.
As a result, sizing your string space to the absolute minimum that will allow your program to run will actually potentially cause a lot of overhead since garbage collection will trigger continually. However, by sizing the string space larger, you can get a potentially huge improvement in program execution speed. Exactly how much to overallocate will depend on factors such as how much your program manipulates strings and how big those strings are on average. If your program is heavy on string manipulation but doesn’t need a lot of variables or arrays, you may be well employed allocating string space upwards of 16K. Remember that garbage collection speed depends on the number of extant strings (and to a lesser extent their combined length), not the size of string space. So even if you burn through 16K of string space, if you only have a dozen strings anywhere, garbage collection could be quite fast and only trigger a long intervals.
Okay, that explains how strings are stored. However, you still have to do stuff with them. In comes the expression evaluator. This is the same expression evaluator that handles multiplication, subtraction, comparison, number parsing, function calls, and the like. You’re probably aware that calculations can get quite complex when you start putting in parentheses or mixing multiplication and addition. It gets even worse when you introduce function calls (things like SIN() or SQR() or what have you). Often the evaluator will have to pause and evaluate a higher priority expression that it has just encountered. In doing so, it has to save its current result then evaluate the higher priority expression and then come back to where it was and complete the original expression. The stack is used for keeping track of this stuff. If you’re tight on memory due to a lot of variables, a huge program, a lot of string space, or a combination, you might actually encounter an ?OM ERROR on a line that just does a calculation. This is why.
That’s all nice for numeric expressions, of course, but surely string expressions don’t run into that? Don’t they just go from left to right since about the only thing to do is concatenation? Well, that’s right as far as it goes. But what happens if you write:
Let’s walk through what the interpreter does. It first determines that it is doing assignment (“LET”). Then it evaluates the first term which it sees is a function call. It happily starts running LEFT$ which then proceeds to evaluate an expression to get the string. In this case, it’s a variable reference so it just grabs that. Then it has to evaluate the second parameter (which may involve strings and the like but by the time that is done, we don’t see any of that). Then it does its work on B$, taking up to the first 4 characters from it. Now it has to return that as a temporary string. It can’t just point to the data in B$. Instead, it allocates space in string space, creates a new descriptor, and has to put that somewhere. I’ll get to that later.
Now the interpreter has a temporary string. It now gets to the operator, sees that it can just process it, so it now processes the call to RIGHT$. RIGHT$ does just what LEFT$ did. This time, though, the evaluator has to do something more complicated: it first evaluates C$. Then sees concatenation. If then evaluates D$. Finally, it allocates string space for the combined length, does the concatenation, creates a new descriptor, and has to put that somewhere. Then RIGHT$ goes on and evaluates its second parameter. Then it does just like LEFT$ and allocates string space, creates a descriptor, and returns that. It also throws away the temporary created evaluating C$+D$.
Finally, the top level concatenation can happen. It has the two temporaries (from LEFT$ and RIGHT$). It allocates string space, creates a descriptor, copies the string contents, and has to put that somewhere. It also tosses the two temporaries that it used. Then the temporary created by the concatenation can be stored in A$ and the temporary can be lost.
If you look carefully, we had to have two temporary strings allocated during that process. It wouldn’t take much to make it even more complicated.
The String Stack
Now let’s talk about those temporaries. We could just throw those temporaries on the stack like we do with numbers. That would work find during evaluation. Except it doesn’t. Remember how garbage collection works? It has to be able to check ALL extant strings. But it would have no way of knowing what bits on the stack are string descriptors and what bits are just regular noise. Thus, any string descriptor stored on the system stack would be invisible to the garbage collector. As a result, if it ran at any time while a string expression is being evaluated, it might clobber the data of one of those temporary strings. Instead, we need a way to save those temporaries that the garbage collector can see. This is where the string stack comes in.
The string stack is exactly what it sounds like. Whenever a temporary string is created, its descriptor is pushed onto the string stack. It will hang out there until the expression evaluation process that created it completes and the result is fetched. This stack operates outside the expression evaluator. That means subsequent invocations of the expression evaluator will not clobber the temporary value and it will continue to be visible to the garbage collector.
Color Basic’s string stack has room for 8 entries. That means that if your string “calculation” requires creating more than 8 temporary (or perhaps “anonymous” would be a better term?) strings, the string stack will overflow and you will get one of the rarest errors in Color Basic: ?ST ERROR (String Formula Too Complex).
Now, there is one additional bit about string stack usage. The string stack is not used if the result of an evaluation is a simple variable or array element reference. There is no need since those descriptors are visible to the garbage collector. Thus, this will only use one string stack entry to evaluate:
But this will use two:
The former is simple variable references. The other must create anonymous string descriptors. In both cases, the result of the concatentation will be placed on the string stack. But the second case must put its two anonymous string descriptors there first, though when they are retrieved to do the concatenation, they are removed from the string stack.
Strings and Expression Evaluation
You may be wondering how the expression evaluator and related infrastructure determines whether a string is on the string stack. Or even how it keeps track of strings that have been saved prior to evaluating a higher priority operation. It’s actually quite simple.
During evaluation, a type flag is maintained for each value. If that type is numeric, then the “floating point accumulator” will hold a floating point value. Otherwise, the “floating point accumulator” will hold a pointer to the relevant string descriptor. That pointer may point to a variable table entry, an array element, or a string stack entry. When the appropriate call to retrieve such a value is made, the retrieved pointer is checked against the string stack and if it is at the top of the string stack, it is “popped” off.
If you use a different string data allocation strategy, the string stack might not be required. It’s only required if your allocation strategy requires being able to identify all extant strings. So, for instance, if string space is allocated using malloc(), then any time a string descriptor is disposed of, the corresponding free() call could be made. There would be no need to bother with garbage collection or tracking extant strings in that case.
You might also be wondering why RIGHT$ or LEFT$ need to bother with allocating new strings. After all, string data is immutable until it’s no longer needed, right. Well, actually that isn’t quite true. Assigning to MID$() actually modifies the original string if it is in string space:
Even if assigning to MID$ didn’t change the actual allocated string (which would mess up any other strings that happen to point to the same data), you still couldn’t treat string data as immutable due to the way garbage collection works. If LEFT$, say, returned the original string data pointer but with a new descriptor with a different length, you would now have two strings pointing to the same string space. The way the garbage collector works, this would lead to used string space getting larger after garbage collection since it would end up copying that string twice (or at least the overlapping part). That would also end up clobbering string data since it does the string data relocation “in place” rather than using a duplicate string space area. (And, on a 32K memory space, you really don’t want to have that.) You would have to to some additional logic to prevent this issue which would make garbage collection even slower since the interpreter would have to check for overlapping string allocations. That would be a brilliant sorce of bugs among other things. The garbage collection algorithm is already confusing enough without adding in even more confusion and complexity.
Anyway, the practical upshot is that it is much more robust to simply allocate a new anonymous string for things like LEFT$ instead of trying to be overly clever.