Writing an Assembler

My first real experience with computers at any level was with the TRS80 Color Computer back in the 1980s. That machine has a Motorola 6809 CPU in it (which is an 8 bit cpu with a 16 bit address space). It was on that machine I first learned about assembly language programming. (NOTE: it is incorrect to call it “assembler programming” or “programming in assembler”.) As a result, I have something of a soft spot for that old machine and it has become something of a hobby.

Over the years, a great deal of work has been put into new hardware and software for that system. This includes a port of the Donkey Kong arcade game as recently as 2007. Yes, that’s right. This year. There has also been a replacement CPU from Hitachi (the 6309) which is a drop in pin-compatible and instruction compatible chip. It has a few extensions, however, which make it somewhat interesting to play with.

Recently, I thought it would be interesting to write an operating system for that system. I wanted a somewhat unix-like system. So I geared up with a cross-assembler (that’s an assembler that runs on a different platform than the one it assembles code for) and other useful tools and got to work. I soon found that the assembler I was using had a few things that I didn’t like, including a lack of macros and throwing phasing errors (more on those later) in cases where I thought it shouldn’t. Oh, I could make do with that assembler and make everything work, but I thought, why? So I undertook to write my own assembler and I decided I was going to make it work really hard to avoid phasing errors.

Now there are a number of types of assembler. Most assemblers are two pass assemblers. That is, the first pass resolves the addresses of all the symbols in the program and the second pass actually generates the code. Now this works really well until you have a forward reference (referencing a symbol that is defined later). On the first pass, the assembler will have to assume the largest possible size for the addressing mode used to refer to the symbol. On the second pass, however, it may generate a smaller addressing mode which will then cause the symbol to have a different address on the second pass than it did on the first pass. This is known as a phasing error. Phasing errors can be completely avoided by always using the largest possible addressing mode when there is any doubt. However, this is not ideal. (Note that the EDTASM+ native assembler for the TRS80 Color Computer (CoCo for short) does just that.)

Now I figured it must be possible for the assembler to resolve many forward references to the smallest possible addressing mode without causing phasing errors if an intermediate representation of the entire program is maintained and then multiple passes through it are performed to resolve the symbol location and instruction mode problems. In reality, this means injecting an optimization pass between the first and second passes of the assembler. Let’s call this pass 1.5. Note that in the case of my assembler, it will read the source code exactly once. Once it has done so, the entire program is stored in memory in some form and all subsequent processing occurs in memory. This is not ideal and can cause a very large working set. However, it is unlikely that a program for the 6809 which has only a 64K address space will generate a working set that is too large for main memory on a modern Linux system.

Pass 1 is relatively straight forward. It simply involves reading all the source lines, figuring out the opcodes instruction sizes and assigning addresses to symbols. However, because we don’t know the correct size of an instruction at this stage in the face of a forward reference, the assembler must maintain a range of addresses for each instruction and a range of values for each symbol. Interestingly enough, it is often possible to determine the precise size of an instruction even in the face of an uncertain symbol value and and uncertain address for the instruction and in any case where this is feasible, the assembler will do so.

Pass 2 still generates the code. However, since pass 2 runs after pass 1.5 has resolved the phasing problems, there cannot be any phasing errors at this point and the code can be generated without trouble (unless, of course, there are other problems with the source code.)

Pass 1.5 is where the real complexity and trouble appears. Many phasing problems can be resolved by a second pass through the code now that symbol locations are approximately known. However, this will often not be enough. As a result, pass 1.5 actually makes multiple sweeps through the code until it finds no phasing problems that have resolved themselves. Once this happens, it makes another sweep through the code and looks for the first ambiguous instruction. If it finds none, it knows that the phase correction process is done. If it does find one, it forces it to the maximum adressing mode size and then starts pass 1.5 over again to see if this single change allows correcting other remaining problems.

Now this will not solve all addressing mode problems in the most optimal way and it is still useful to receive a warning about addressing modes that are larger than they need to be. However, most of the simpler situations should be resolved by this method and more complex situations really do need a human touch to get right anyway. After all, if the code is that complex, maybe there’s something else wrong with it anyway?

In any event, I plan to release this assembler as open source once I have it functional. Hopefully I will be able to finish it in the next week or so but it is a non-trivial undertaking. It is further complicated by the fact that the 6309 extensions to the 6809 instruction set are not particularly regular and add additional types of operand and additional confusion to the whole situation. In other words, there is a great deal of work involved in supporting all features of the processor.

My advice to anyone who wants to write an assembler for their favourite cpu is to think carefully about whether you really need to. The 6309 is a relatively simple cpu compared to, say, the 80386 or the AMD Opteron. If things are this complex for a relatively simple 8 bit cpu, imagine the modern 64 bit processors that support legacy 16 and 32 bit instruction sets! So if you don’t have a good reason to write an assembler, you probably shouldn’t.

Compiler Mismatch

I managed to figure out why some software on my system simply refused to work (or even crashed) when using the SCIM input method scheme. It turns out I had upgraded my compiler for whatever reason and that was causing mismatches in the compiled code. Not terribly helpful really. I downgraded the compiler back to what came with the system (and, presumable, which the rest of the software on the system would have been compiled with) and rebuilt the SCIM stuff. Oddly enough, entering ひらがな (hiragana) characters is now a lot easier.

Theoretically, everything should now accept the new input scheme but you never know. Time will tell. Even so, it looks like everything I use on a regular basis is happy with the now compiler matched version of the SCIM software.

Multilingual Typing…

I have managed to get my system configured to allow me to type in multiple languages with my bog standard english keyboard. While that might not sound particularly exciting, it is somewhat cool. And also somewhat useful if you want to put, say, ひらがな (hiragana) characters into a document. It still has some problems, like some software not being happy with the idea of fancy input methods. Even so, it’s several steps forward from where it was before.

Of course, the scheme allows more than just japanese characters. It allows for pretty much any characters I feel like inputting, once I figure out how to do it for a particular language. It’s a new toy to play around with. It should prove to be amusing for a while at least.

Overall, it’s amazing what a simple search on google revealed. I found a document that up and said to use a software package called SCIM and to install something called anthy and lo, things started to work. And, oddly enough, it worked as advertised. How lucky can a person get?

Well, enough rambling about that. Back to playing with my new toy.

Memory Failure

You wouldn’t think that memory that was working perfectly for two years would up and fail causing a computer to stop booting. And you would be wrong. I had just such an event happen to me.

It manifested itself as randomly crashing software initially. This meant the problem could be anything from the processor to the memory to the hard drive to the software on the system being messed up. Eventually it got to the point that I could do nothing with the computer so I decided to troubleshoot it. I made sure everything was assembled correctly and found nothing wrong there. I checked for overheating and found no heat problems. Then I swapped the memory with that from another computer. And lo, the other computer started crashing and the first one became stable.

So now I’m $100 poorer but have shiny new memory in my computer.

I hate when computer hardware fails.

Stupid Verisign Tricks Redux

Reactions to the actions taken by Verisign as described in my blog entry from September 16 have been heated and varied. It currently appears that Verisign has no intention of ceasing this nonsense. However, certain internet authorities have finally been heard from today.

The Internet Architecture Board has released an analysis of the use of wildcard DNS records at high levels in the DNS hierarchy. Anyone interested in this situation and its implications is encouraged to read this analysis. Perhaps the best part of the article from my perspective is this: "Proposed guideline: If you want to use wildcards in your zone and understand the risks, go ahead, but only do so with the informed consent of the entities that are delegated within your zone."

The Internet Corporation for Assigned Names and Numbers (ICANN) has also finally broken its silence with the following advisory about the situation. While I usually disagree with ICANN’s tactics, this particular one of actually studying the issue and asking for feedback from other organizations is good. In particular their call for Verisign to voluntarily suspend the operation of their wildcard until the investigation is completed.

As things go, this issue has been little more than a minor technical annoyance to many of us in the industry. However, it was the sheer gall it took on the part of Verisign to say that they were doing this for the good of the internet when they, by their own admission, were profiting from it that got up most of our noses. Not to mention the protocol breakage that is mentioned in the IAB article noted above.

In fairness to Verisign, it should be noted that they were not the first ones to introduce a wildcard into a TLD, simply the most prominent one.

Stupid Verisign Tricks

On Monday, Verisign, the company that manages the contents of the .com and .net zone files, hijacked all non-existant domains to point to an intermittently functional search service. This does not affect any top level domain other than .com and .net.

Apparently, Verisign has decided that any DNS query for an A (IP address) record for any second level domain in the .com or .net top level domains will now resolve to an IP address controlled by Verisign which then attempts to guess what the user is trying to do. While this sounds like a great idea on the surface, and is, in fact, markedly similar to what many web browsers and online providers do, this is a horribly bad idea. When my web browser offers to search for the domain I misspelled, it affects me and me alone. When an online provider does this, it affects only the customers of that provider. In both cases, there is the possiblity of using a different browser or service. However, the the case of Verisign doing it using the DNS system, it makes it impossible for anyone trying to access a .com or .net domain to opt out of it, regardless of provider or web browser or any other consideration.

In addition, the DNS system is designed to respond with a negative answer when a request is made for a name that does not exist. This allows web browsers, email servers, and so on, to do something useful in this circumstance, like tell the user the domain does not exist. However, by adding an A record for non-existing domains, it is now impossible for a mail server to know that the domain really doesn’t exist. And while the use can likely figure out that the web site they requested does not exist based on the response from the server Verisign is point it to, automated systems that rely this negative response behaviour have no way of deducing this. And relying on this negative response is by no means broken since that is the only way the system can indicate that a domain does not exist.

To make matters worse, Verisign provided no notice to relevant internet community groups, such as NANOG, that such a change to the standard operating procedure was going to be done. In fact, the first notice many network operators had was that nonexistant domains were suddenly resolving. Many other learned this via discussion threads on NANOG which can be read in the NANOG archive at the above link. Many others in the internet community would have learned of this from the Slashdot article and related discussion on the issue.

The uproar on this issue shows no signs of dying down any time soon either as messages fly around the internet an amazing rate.

I hereby call upon Verisign to do the right thing and cease and desist this reprehensible attempt to hijack the .com and .net domains as their own personal playground. It is high time that Verisign started acting in a manner befitting an orgranization on whom a public trust has been bestowed!

Update at 1645: It looks like the authors of the BIND name server software are creating a patch that will allow users of BIND to bypass the Verisign brain damage. See a news report here. BIND is available from the ISC.

Update at 1435, Sept 17: Debate continues to rage about this issue. Some folks have taken actions which may or may not help. The ISC has released a patch to BIND which allows people to work around the problem. In addition, one person has publicly sent a formal complaint to ICANN (the body supposedly in charge of .com and .net overall) which is worth a read for those interested.