In his recent article on Color Basic String Theory, Allen Huffman raises some interesting points. It turns out that Color Basic (on the TRS-80 Color Computer) has a very straight forward way of handling strings but there are some details that arise from it that may prove a bit surprising. Additionally, there are requirements and limitations that drive some of the design of the string handling system. I’m going to go into detail on just how Color Basic handles strings behind the scenes.
Allen Huffman has been posting a series of articles on optimizing Color Basic as found on the TRS80 Color Computer. This is a response to his most recent (as of this writing) entry. Note that I’m not criticizing his article or the conclusions he reaches in it. Instead, this is intended to provide some more detailed background information. It also provides a good starting point to discuss some points about optimization, so even if you aren’t into programming on the Coco, you might find it interesting.
In his article, Allen talks about using a trick involving INSTR() and INKEY$ to optimize a key press handling loop. I won’t go into details on that aspect of it. Suffice to say that the trick works nicely and is something that should have occurred to many people back in the day but apparently did not. For a detailed analysis of the trick, head on over to Allen’s article.
After explaining the trick, he goes on to discuss two possible variants which take advantage of either ON…GOTO or ON…GOSUB. He goes through a number of benchmarks and comes up with some results. However, he was a bit surprised by some of his benchmark results. I am going to run through some implementation details for GOTO and GOSUB in Color Basic. Then I am going to examine how that affects his benchmarks.
Under the Hood
The first implementation detail that needs explaining is finding the destination line in the program. This affects GOTO and GOSUB equally. To understand this, we need to know how programs are stored in memory. Each program line have a 4 byte header which consists of a 16 bit line number and a 16 bit pointer to the start of the next line. Each line also ends with a single NUL byte. Lines are stored sequentially in memory in line number order. The next line pointer makes scanning for a particular line number fairly efficient since it is a constant time operation to find the start of the next line. However, it also means that the worst case scenario means that we must examine every single line in the program to find the destination line.
Because line number lookups are so common, there are a couple of optimizations to the process. If the requested line number is higher than the number of the line currently executing, the search starts from the next line in the program rather than the start of the program. Also, the search stops as soon as a line number that is higher (or equal, obviously) to the requested line number is encountered. For the sake of optimization, the not found case doesn’t matter. However, the practical upshot is that a GOTO to the next line of the program has the same real cost as a GOTO to the first line of the program. In fact, a goto N lines ahead in the program is equivalent to a GOTO to the Nth line of the overall program (unless the Nth line is ahead of the GOTO statement).
Another point is that both GOTO and GOSUB will skip to the end of the current line so that the line search can happen using the line headers described above.
The final important bit is how GOSUB actually works. GOSUB saves the current line number and execution pointer on the stack and then processes the remainder of the command as a GOTO. Then, when RETURN happens, the line number and execution pointer are restored from the stack and execution resumes as though a DATA command is executing. That means any extra stuff at the end of the GOSUB command is skipped. This is necessary for ON…GOSUB to work. As a matter of fact, the actual transfer of control is handled by exactly the same code as a plain GOTO or GOSUB.
Finally, we should look at the stack situation. Both FOR/NEXT and GOSUB/RETURN store information on the stack. Each record on the stack is prefixed a byte that indicates the entry type. For FOR/NEXT, this yields 18 bytes per entry. For GOSUB, this yields 5 bytes per entry. The stack search looks for the first entry that isn’t a FOR/NEXT record. If that happens to be something saved by GOSUB, then the stack is reset and control resumes at the calling point. Note that this means that any open FOR/NEXT loops will also be purged, which means that RETURN in the middle of a FOR/NEXT loop behaves as expected.
The implications of all of those details are interesting. There are four possible cases arising from Allen’s trick. First, we have the GOTO/GOTO option which simply restarts the INKEY loop:
100 ON INSTR(" ABC", INKEY$) GOTO 100,200,300,400:GOTO 100 200 REM STUFF 210 GOTO 100 300 REM STUFF 310 GOTO 100 400 REM STUFF 410 GOTO 100
Then we have the GOTO/GOTO option that resumes AFTER the INKEY loop (note the dummy handler branches to the line following the INKEY check):
100 ON INSTR(" ABC", INKEY$) GOTO 110, 200, 300, 400 110 REM STUFF 120 GOTO 100 200 REM STUFF 210 GOTO 110 300 REM STUFF 310 GOTO 110 400 REM STUFF 410 GOTO 110
Then we have the GOSUB/RETURN that simply restarts the INKEY loop (note the dummy handler branches to a RETURN statement):
100 ON INSTR(" ABC", INKEY$) GOSUB 210, 200, 300, 400:GOTO 100 200 REM STUFF 210 RETURN 300 REM STUFF 310 RETURN 400 REM STUFF 410 RETURN
And the GOSUB/RETURN variant that continues execution after the INKEY check (note the dummy handler branches to a RETURN statement):
100 ON INSTR(" ABC", INKEY$) GOSUB 210, 200, 300, 400 110 REM STUFF 120 GOTO 100 200 REM STUFF 210 RETURN 300 REM STUFF 310 RETURN 400 REM STUFF 410 RETURN
Allen’s benchmarking examples all have the INKEY check wrapped in a FOR loop in order to run enough times to get a useful benchmark using the 60Hz TIMER feature of Extended Basic. That means he is not testing the case where the INKEY loop is simply restarted after processing a key press.
He did benchmark a simple GOTO/GOTO setup and a simple GOSUB/RETURN setup to determine whether GOTO/GOTO or GOSUB/RETURN was, in fact, faster. His result showed the GOSUB/RETURN was faster than GOTO/GOTO. This is expected because there is only one line number search in the GOSUB/RETURN case while the GOTO/GOTO case has two, one of which must start at the beginning of the program because at least one has to jump backwards. He did some additional tinkering to determine that the size of the program affects the line number lookup time which is expected based on the internals described above. His result showed the GOTO/GOTO was about 7% slower than GOSUB/RETURN.
He also discovered that ON…GOTO and ON…GOSUB were markedly slower than plain GOTO/GOSUB by testing variants using “ON 1” with a single destination line number. This would be exactly equivalent to a plain GOTO or GOSUB as far as execution result goes. That these variants were markedly slower makes sense when you think about it because the ON variant has to evaluate an expression to determine which of the destination line number options to use. He was surprised to find no actual difference between ON…GOTO and ON…GOSUB in this test, though. His tests were using 1000 iterations using a 60Hz timer. A measurable difference might appear using a larger iteration count. However, that doesn’t explain why there is no measurable difference between GOSUB and GOTO in this case when in the base case, the difference was measurable at 7%.
Let’s consider what is going on. In the ON case, the following steps occur. First, an expression is evaluated. Then a check for GOTO or GOSUB is performed. Then the line number options are skipped until there are no more or the correct one is found. Then control is transferred to the base GOTO/GOSUB code. What is likely happening is that the cost of expression evaluation is swamping all other variables in this particular situation.
Because that seemed odd, I tried the same benchmarks. I got some variation between runs of the same benchmark. Taking rough averages of the results across multiple runs, I found that bare GOTO/GOTO was about 6% slower than bare GOSUB. I also found no difference between the “ON 1” variants. This was using mess to run the benchmarks so some of the variation between runs probably came from there. However, it does seem to confirm Allen’s results.
So what does this tell us about the overhead of GOSUB/RETURN vs GOTO/GOTO? I think it tells us that in a trivial program, the overheads are not significantly different between RETURN followed by a “skip to the end of the statement” compared to a “look up a line number and transfer control directly” operation. I think you will find that the actual cost in instructions executed is relatively close in the case of a fairly trivial program. For a much larger program, I would expect the GOSUB/RETURN scenario to be better.
Allen also tested the results with more representative program structure (like those in the examples above) and with some extra lines. His result was that the GOTO/GOTO situation depends on factors like program size while the GOSUB/RETURN is less impacted by that. Based on the internals described above, that would be the expected result.
So far, this only examines the case where additional processing is going to occur after the INKEY check. In the case where the INKEY check is simply going to run until a key is pressed, the key handler is dispatched, and then the INKEY loop is restarted will always be somewhat faster using the GOTO/GOTO method even though optimizing that case is pretty pointless. This simply must be the case because you must always jump back to the start of the INKEY loop, regardless what happens as a result. If you are using GOSUB/RETURN, then there must be a RETURN statement prior to that necessary GOTO. If, instead, you simply use GOTO directly, you can skip all the stack manipulation. Either way, the GOTO has to start from the beginning of the program and will be going to the same destination. That means it will take the same length of time to execute that GOTO regardless of its proximity to its destination.
Essentially, based on these results and Allen’s results, it seems that there is no real benefit to using the GOTO/GOTO option. However, there is a good chance that the benchmark structure we have used is introducing enough noise in the results that we are missing some critical detail. It would probably be worth running equivalent benchmarks with 10,000 or 100,000 iterations for each test and running the tests a couple of dozen times each in order to see if there is any signfiicant difference between the ON…GOTO and ON…GOSUB cases. However, that starts getting into statistical analyisis and is probably not worth it unless your program really depends on squeezing the absolute maximum performance out of what you are benchmarking.
There are a couple of additional comments I want to make before ending this.
First, you may recall that during a line number search, the actual cost of depends only on the number of lines to search. The length of the lines has no impact. That would suggest that compacting the program into as few lines as possible would be a total win. That will often be the case, of course. However, executing either a GOTO or GOSUB (or an implied GOTO as part of THEN or ELSE) requires scanning byte by byte to the end of the line in order to determine if a forward scan would be more beneficial. That means if the GOTO or GOSUB is near the start of a long line, any speed benefit from compacting the program into as few lines as possible may be negated.
Next, you should always benchmark any change you make to improve performance. As seen above in the benchmarks, sometimes something that should make things faster doesn’t have a measurable impact. You would expect from the tests with GOSUB/RETRN vs GOTO/GOTO, that ON…GOTO/GOTO would be slower and ON…GOSUB/RETURN and yet the particular benchmark used by both Allen and me shows no measurable difference. That is not to say it wouldn’t show a difference in a different coding environment. However, you shouldn’t assume that just because one has more overhead than the other that it is that obvious overhead that is the important factor. In fact, the overhead of executing GOSUB vs GOTO is pretty small when you look at the actual ROM code so it’s relatively easy to understand why that would disappear in the noise when combined with a much more expensive operation like evaluating an expression.
Basically, the overall point is that just because it seems logical that something should be faster, it doesn’t mean that is, or that the speed increase will be enough to be worth it.
Finally, you need to consider whether your benchmarking process is actually affecting the results of the benchmark. If you can think of multiple ways to benchmark something, do it. That way, you will have a better idea of the real impact of a change. Also related to this, you should consider whether your benchmark yielded a real difference or not, or if the lack of difference is real. This may require statistical analysis and, so, may not be worth it. In this case, for instance, it would likely be overkill.
Recently, I’ve been playing around with constructing replacement ROMs for the Coco3 internal ROM. Since I have neither EPROM programming hardware nor the hardware knowledge to change out the ROM in one of my actual Coco3 machines, I have been using emulators to test things. One of the most useful emulators for testing is Mess (related to the Mame project). It turns out, however, that Mess gets one important aspect of how the ROM is handled wrong.
At startup, the 6809 fetches the first address to start executing at from FFFEh. Because this has to be present at power on, this must come from ROM somewhere. On the Coco1 and Coco2, this came from BFFEh in the Color Basic ROM. In fact, all the interrupt vectors from FFF0h through FFFFh come from BFF0h through BFFFh. This redirection is accomplished by the hardware in the Coco.
On the Coco3, this vector table actually comes from the top bytes of the internal ROM which would be at FFFxh if the top 256 bytes of the ROM were not overshadowed by the I/O page. Again, the hardware in the Coco3 automatically forces accesses in the range of FFF0h through FFFFh to the internal ROM but this time at the top of it.
One can demonstrate that the active vector area on the Coco3 does not come from BFFxh by powering on a Coco3 and doing the following:
POKE&HFFDE,0 ?PEEK(&HFFF0) ?PEEK(&HBFF0)
The first command forces the machine to ROM mode to eliminate any interference from the ROM/RAM transfer. The second one shows the value retrived from FFF0x, the bottom of the CPU vector area, and the one vector that is not used by the 6809. The third one shows the contents of BFF0h which would be the value at FFF0h if the redirection was to the BFFxh range. On the Coco3, one gets a value of 0 from the FFF0h address but 166 from BFF0h. That shows that the BFFxh range is not used for the FFFxh range. The reason it shows that is remapping only 14 bytes is not practical from a binary logic point of view. It is much easier and cheaper to remap 16 bytes. (In fact, it looks like 32 bytes is remapped in this case, including the FFE0h through FFEFh range.)
One can determine that the vectors must then come from the top of the internal ROM by examining any of a number of dumps of that ROM available from multiple sources. It is possible to examine it on the Coco3 itself but that is a complicated process due to the way ROM mapping interacts with the MMU. Anyway, one can see that the only other place in the internal ROM that the interrupt vector values appear is right at the top of it (what would be FFFxh). And, the icing on the cake is that the byte that would be at FFF0h is, in fact, zero. Further checking will show that the values obtained by peeking at FFExh also match the top part of the internal ROM.
Thus, instead of mapping FFE0h throught FFFFh to offset 3FE0h in the ROM, Mess should be mapping the addresses to offset 7FE0h in the ROM. Making the trivial change to src/mess/drivers/coco3.c in the Mess source code to change that mapping yields an emulator that still works with the stock ROM but now returns the correct value at FFF0h.
It is interesting to note that the above information is at odds with other documentation available, notably Tepolt’s Coco3 addendum. However, the above information is experimentally verified.
It turns out that generating maps for a game like lwciv is not an easy problem to crack. The map is a critical component of game play and, therefore, must be generated in a way that is actually playable. For a Civilization type game, that means there must be land masses that are large enough to be playable before seafaring technologies are discovered. They must also be far enough apart to be difficult to reach with early seafaring technology, but the placement requirements must not be so strict that they limit the map variations. What follows is part one of my map generation odyssey.
Last fall, I had this notion that it ought to be possible to implement a Civilization style game on the Coco3. With 512K RAM and 320×225 16 colour graphics, such a game should be at least as playable as the original civilization. Based on that notion, I set about tinkering to see if it was, in fact, plausible to do so. Herein are chronicled my adventures so far. Continue reading “lwciv…”