Writing an Assembler

My first real experience with computers at any level was with the TRS80 Color Computer back in the 1980s. That machine has a Motorola 6809 CPU in it (which is an 8 bit cpu with a 16 bit address space). It was on that machine I first learned about assembly language programming. (NOTE: it is incorrect to call it “assembler programming” or “programming in assembler”.) As a result, I have something of a soft spot for that old machine and it has become something of a hobby.

Over the years, a great deal of work has been put into new hardware and software for that system. This includes a port of the Donkey Kong arcade game as recently as 2007. Yes, that’s right. This year. There has also been a replacement CPU from Hitachi (the 6309) which is a drop in pin-compatible and instruction compatible chip. It has a few extensions, however, which make it somewhat interesting to play with.

Recently, I thought it would be interesting to write an operating system for that system. I wanted a somewhat unix-like system. So I geared up with a cross-assembler (that’s an assembler that runs on a different platform than the one it assembles code for) and other useful tools and got to work. I soon found that the assembler I was using had a few things that I didn’t like, including a lack of macros and throwing phasing errors (more on those later) in cases where I thought it shouldn’t. Oh, I could make do with that assembler and make everything work, but I thought, why? So I undertook to write my own assembler and I decided I was going to make it work really hard to avoid phasing errors.

Now there are a number of types of assembler. Most assemblers are two pass assemblers. That is, the first pass resolves the addresses of all the symbols in the program and the second pass actually generates the code. Now this works really well until you have a forward reference (referencing a symbol that is defined later). On the first pass, the assembler will have to assume the largest possible size for the addressing mode used to refer to the symbol. On the second pass, however, it may generate a smaller addressing mode which will then cause the symbol to have a different address on the second pass than it did on the first pass. This is known as a phasing error. Phasing errors can be completely avoided by always using the largest possible addressing mode when there is any doubt. However, this is not ideal. (Note that the EDTASM+ native assembler for the TRS80 Color Computer (CoCo for short) does just that.)

Now I figured it must be possible for the assembler to resolve many forward references to the smallest possible addressing mode without causing phasing errors if an intermediate representation of the entire program is maintained and then multiple passes through it are performed to resolve the symbol location and instruction mode problems. In reality, this means injecting an optimization pass between the first and second passes of the assembler. Let’s call this pass 1.5. Note that in the case of my assembler, it will read the source code exactly once. Once it has done so, the entire program is stored in memory in some form and all subsequent processing occurs in memory. This is not ideal and can cause a very large working set. However, it is unlikely that a program for the 6809 which has only a 64K address space will generate a working set that is too large for main memory on a modern Linux system.

Pass 1 is relatively straight forward. It simply involves reading all the source lines, figuring out the opcodes instruction sizes and assigning addresses to symbols. However, because we don’t know the correct size of an instruction at this stage in the face of a forward reference, the assembler must maintain a range of addresses for each instruction and a range of values for each symbol. Interestingly enough, it is often possible to determine the precise size of an instruction even in the face of an uncertain symbol value and and uncertain address for the instruction and in any case where this is feasible, the assembler will do so.

Pass 2 still generates the code. However, since pass 2 runs after pass 1.5 has resolved the phasing problems, there cannot be any phasing errors at this point and the code can be generated without trouble (unless, of course, there are other problems with the source code.)

Pass 1.5 is where the real complexity and trouble appears. Many phasing problems can be resolved by a second pass through the code now that symbol locations are approximately known. However, this will often not be enough. As a result, pass 1.5 actually makes multiple sweeps through the code until it finds no phasing problems that have resolved themselves. Once this happens, it makes another sweep through the code and looks for the first ambiguous instruction. If it finds none, it knows that the phase correction process is done. If it does find one, it forces it to the maximum adressing mode size and then starts pass 1.5 over again to see if this single change allows correcting other remaining problems.

Now this will not solve all addressing mode problems in the most optimal way and it is still useful to receive a warning about addressing modes that are larger than they need to be. However, most of the simpler situations should be resolved by this method and more complex situations really do need a human touch to get right anyway. After all, if the code is that complex, maybe there’s something else wrong with it anyway?

In any event, I plan to release this assembler as open source once I have it functional. Hopefully I will be able to finish it in the next week or so but it is a non-trivial undertaking. It is further complicated by the fact that the 6309 extensions to the 6809 instruction set are not particularly regular and add additional types of operand and additional confusion to the whole situation. In other words, there is a great deal of work involved in supporting all features of the processor.

My advice to anyone who wants to write an assembler for their favourite cpu is to think carefully about whether you really need to. The 6309 is a relatively simple cpu compared to, say, the 80386 or the AMD Opteron. If things are this complex for a relatively simple 8 bit cpu, imagine the modern 64 bit processors that support legacy 16 and 32 bit instruction sets! So if you don’t have a good reason to write an assembler, you probably shouldn’t.

Leave a Reply

Your email address will not be published. Required fields are marked *