Color Basic and String Handling

In his recent article on Color Basic String Theory, Allen Huffman raises some interesting points. It turns out that Color Basic (on the TRS-80 Color Computer) has a very straight forward way of handling strings but there are some details that arise from it that may prove a bit surprising. Additionally, there are requirements and limitations that drive some of the design of the string handling system. I’m going to go into detail on just how Color Basic handles strings behind the scenes.

Continue reading “Color Basic and String Handling”

Leap Second on Dec 31. Sigh.

Yet again, we have a leap second being added to UTC to further complicate everyone’s lives. Well, that might be overstating it, but it sure complicates the lives of server and network administrators, among others. The notion is that leap seconds are required to keep UTC in sync with Earth’s rotation and to prevent our clocks eventually being so far out of sync that solar noon will be at midnight. That notion is wrongheaded in the extreme, though.We would simply use some adjustment to get from UTC to local time once it started getting far enough out of sync. Local time would still continue to be approximately related to mean solar time. Continue reading “Leap Second on Dec 31. Sigh.”

Apache mod_proxy_fcgi and php-fpm

To anyone who’s been following developments in the web hosting world, the existence of FastCGI shouldn’t come as a surprise. Neither should the fact that the PHP folks provide a FastCGI process manager called php-fpm. Apache 2.4 comes with mod_proxy_fcgi which makes connecting all of that to Apache relatively easy (for sufficiently recent versions of Apache 2.4). What isn’t easy to find is a nice, simple, recipe for plumbing it all together without undesirable side effects. Continue reading “Apache mod_proxy_fcgi and php-fpm”

Libraries are always good, right?

There is a pervasive belief in the software world that you should never re-invent the wheel and that an existing library is always the best solution. While there is some merit to the sentiment that re-inventing the wheel is often pointless or dangerous, I have recently come to the conclusion that this is not always the case.

Continue reading “Libraries are always good, right?”

“Responsive” design

These days, one of the big buzzwords is “responsive design”. But what, exactly, is a “responsive” design? What is it responding to? Theoretically, it means building a site that provides an optimal experience regardless of the device it is being viewed on. Here’s why that notion is stupid.

First off, you have no idea what device a visitor is viewing your site on. If you think you do, you are seriously deluded at best. How do you find out what device they have? Maybe you sniff the user agent string on the server side and send a page designed for that type of device. But how do you know that the user agent strng isn’t lying to you? Even if yu can trust it, how do you handle the case where the viewer has his web browser at a tenth of the screen size? Or maximized on a ridiculously large screen? Or maybe he is viewing his ridiculously large screen from across the street? The user agent string tells you none of this.

Now some dim bulb out there is probably thinking that this sounds like a brilliant application for javascript. Or, worse, jquery. Are you insane? That is just asking for trouble. Sure. Javascript allows you to query the window size and charge around adjusting your content to fit on the screen in some useful manner. It even allows you to handle the case where the window is resized! But what if the viewer has javascript disabled or doesn’t support it? How is it going to play with a screen reader for a blind visitor? Or a brail readout?

Wait, what? Who said anything about blind people? Think about it. The device showing the site may not be a bitmapped display. After all, what good is a bitmapped graphics display to a blind person? If you’re going to be responsive, why not be responsive to that, too? Okay, I’ll grant you that your content may not be relevant to blind people. It may also be that blind people are such a small part of your audience that you don’t care. But to make a site truly responsive, I contend that you must at least consider them.

Of course, we can also make use of new fangled features of CSS and have the web site choose a layout based on the actual viewport size. We could, say, suppress useless columns on smaller displays or reduce the navigation sprawl. Perhaps you limit the width of a text column to something pleasant to read instead of having it sprawl from one side to the other. This would even work for people resizing browser windows. It doesn’t solve the massive screen from a city block away problem, but that is not your problem anyway. That user likely has a magnifier of some kind. This may be the best overall solution once the requisite new fangled features are supported across all the major browsers.

There’s another aspect of site design that is related to responsive design. How many sites have you encountered that used mouseovers to provide a critical access to the navigation with no other possible means of navigating? How does that navigation work on a touch screen interface where it is nearly impossible to actually do a hover? How many other user interface elements have you encountered that are difficult or less than ideal on something that is not substantially similar to a classic desktop computer?

It’s pretty clear that I agree with a “responsive” site that is implemented sensibly (which most are not), so why do I think the notion of “responsive design” is stupid? The whole notion is stupid because it should never have been required in the first place! Every site should be responsive. If we had not gone off the deep end with fancy gewgaws, widgets, gizmos, and buzzwords, every web site would already have been “responsive”. This will date me, but I remember the early days of the web when graphical web browsers were the new thing. I remember using Netscape 0.8. I remember when Lynx was the best option in many circumstances. Back then, we didn’t have these fancy-ass sites that we have now. But they worked! They were “responsive”! Nobody needed a buzzword to make something responsive back then. It just was. In short, the notion of “responsive design” should not be a novel thing. It should just be the thing. Design should just be responsive by default.

 

Why C++ Is Bad

First of all, this is an opinion piece. If you don’t agree, fine. Regardless of your opinion, comment if you must, but remember, comments that are completely unconstructive will not be approved.

Now, on to the point of this rant. C++ sucks. It is, perhaps, tied for the worst language design in the history of computing. Here is why.

First, C++ builds on C, but instead of outright replacing features of C that C++ provided alternatives for, it simply bolted additional aspects onto the existing C language. Thus, instead of having only one keyword to make an aggregate type as in C, you now have two with slightly different semantics. Instead of pointers, you have references which are almost but not quite pointers, and also still pointers. Instead of disallowing the C standard library and providing its own, you have the C++ standard library but you can also use the entire C standard library. Indeed, the end result is basically all the festering badness of C combined with a poorly conceived and implemented object model.

Let me go back to early C++ before the standards committees got their hooks into it. It was a simple enough language which had useful features compared to C. It added operator overloading, function overloading, and objects. In all, it seemed a fairly decent expansion on C. There was only one problem. There was no way to handle error conditions in constructors or destructors, making it impossible to properly handle resource acquisition and release. Why it wasn’t possible to return an error from constructors and destructors does make some sense given their actual usage in many circumstances. It is also not an insurmountable problem, after all, since careful programming can avoid most problems in the destructor and a “factory” method can solve the constructor problem.

Even early C++ showed the beginnings of a horribleness, though. The standard library, particularly the iostream stuff, demonstrates a complete lack of understanding of real world I/O problems. Rather than simply encapsulating something like the C stdio subsystem, instead it provided a ridiculously baroque pile of crap that could only be used effectively for very basic text I/O. While stdio sucks somewhat for structured input, iostream sucks more. Instead of a relatively compact scheme like the printf() family of functions for formatted text output, you have to implement a wordy chain of mutators and output values that does little more than obfuscate just what you are attempting to output in the first place. Instead of being able to scan a format string and compare it with a relatively short list of formatting codes and explanations, you have to somehow divine the initial state of the output stream and then track potentially dozens of state shifting objects just to work out how numbers are going to be formatted. And all of this is using overloaded bit shift operators (apparently because hardly anybody does bit shifting).

C++ apologists and fans will claim that the C++ way is more object oriented. I clearly have no understanding that the C++ way is the one I really want. To them I point out that I learned C++ first and the C. Yes, that’s right. I learned the C++ way first. Then I found printf() and friends and I was enlightened.

Fortunately, it is perfectly possible to completely ignore the iostream bullshit and go straight to using stdio. Consider that stdio gives pretty much everything that iostream attempts to provide. The FILE pointer is basically an object with methods on it (all the stdio functions that operate on FILE pointers) including constructors (fopen() among others) and a destructor (fclose()). Used according to the defined interface, this is a perfect implementation of object orientation. There is no reason that stdio could not have provided a means to attach arbitrary user defined schemes into the FILE object or attach additional format specifiers to the printf() family functions.

Had C++ stopped there, it would have been bad enough, but not horrible. Instead, additional features were slowly grafted onto the language. One such feature was templates, which were conceived in the absence of any inkling of how to properly implement them. Sure, the notion looks brilliant on paper. Let’s write a generic set of code for, say, a linked list, and have the compiler generate specific code for the contents of the list. And this works brilliantly in a single file compilation. But what happens if the template is instantiated in several different files in the same project? Oops! Suddenly you have linker errors from multiply defined symbols! This was such a problem that I remember needing to use “-fno-implicit-templates” and have a separate file that instantiated all of the templates needed for a project. That was bad. It actually needed a thing called “weak symbols” or something like that and special support from the linker to make it work properly.

Even templates were not horrible on the surface except for the terrible syntax selected for them. (< and > anyone? Duh! Obviously a bad idea!) They should have had a distinct syntax using distinct tokens. But that would have potentially broken previous code or something like that. That just wouldn’t do. Instead of inventing a new syntax with new tokens and calling it C++ Version 2 or something, they overloaded existing tokens.

At around the same time came namespaces which seem like a good idea, and, in fact, are probably one of the few features of modern C++ that is not a horrible festering pile of crap. At the very least, they allow one to avoid name conflicts between different packages. Even the choice to use the same scope resolution operator (::) is not a horrible one. I don’t actually have an issue with this one save that it changes the semantics of the language at a fairly late date, but that’s even minor.

The biggest thing that causes no end to trouble is exceptions which were ostensibly a solution to the fact that constructors and destructors cannot report errors. But it actually solves the wrong problem there. In fact, exceptions are not necessarily a bad idea when used judiciously (which they aren’t), but the execution in C++ is terrible as it interacts horribly with other language constructs, notably object construction and destruction when resource exhaustion is involved and used by the exception object. Why it was deemed a good idea to be able to throw an arbitrary object as an exception mystifies me – it makes no sense whatsoever. There should have been strict restrictions on what was throwable. Leaving aside the horrible coding style that exceptions encourage, which is hardly the language’s fault, the execution in C++ makes it nearly impossible to do them right.

As C++ developed over the years, it became clear that the ivory tower academics (read idiots who do no real work) had increasing influence, attempting to turn C++ into a poster child for “abstract data types” and all that rot that they teach in comp sci programs. The absolute worst aspect of modern C++ is the standard template library (STL) which epitomizes the academic ideal of data structure design. What that means is it is horribly inefficient, excessively verbose, and difficult to do any real work using it. Worst of all, it attempts to implement a series of data types that make absolutely no sense in C++ or which cannot be sensibly implemented, and these standard templates often have a convoluted incestuous relationship. This is in addition to the non-template standard types which suffer from all the same problems. The thing all this stuff forgets is that there is no one size fits all solution to any non-trivial data structure, and that often holds true for trivial ones too.

Of course, I can simply ignore the standard template library and the standard data types, for the most part, in my own code, but as soon as I attempt to interface with a library or code written by someone else, I have no alternative but to deal with that crap. The problem is further exacerbated by the existance of such “helpful” libraries as Boost.

In short, C++ is a shining example of design by academic committee. (An academic committee is orders of magnitude worse than an plain committee.) This is the absolute worse way to design a programming language. Period. And this is why C++ is bad, and also why I do everything I can to avoid using it at all. I find object orientation in plain C to be far more pleasant than modern C++. (Yes, you can do object orientation in C with relatively little overhead, and with much more flexibility than in C++. All it takes is a bit of discipline, and that is required to do things right in C++ too.)

Anyway, that’s it for my rant. You may or may not agree. That is your perogative. There is plenty on the internet lambasting C++ and plenty praising it. I invite you to read both sides with an open mind and make your own determination. Consider actual real world situations with real code to understand the issue. Examples presented in literature on both sides are contrived to support the point being made. Examine the examples and understand how they work and why. That can be as enlightening as the prose in any argument.

 

Exceptions Are Evil

Exceptions are evil.

Now that I’ve pissed everyone off, I can get on with explaining. Rather than talk in the abstract about this in the abstract, which is not helpful at all, I am going toss in an example or two. It is far easier to illustrate my point with examples in this case.

First, let me start by saying that the general concept of exceptions is not necessarily evil. Indeed, any program that is properly written must deal with exceptions in some manner, preferably without crashing randomly. Where things tend to go horribly wrong is when a programming language implements support for some flavour of exceptions. For instance, C++. It can obviously also go horribly wrong when a programmer messes up in a language that doesn’t implement any version of exceptions directly.

Leaving aside the syntactic problems with exceptions for the moment, let us consider what an exception actually is. The word would suggest an exception should be reserved for a truly unusual circumstance – something that should not normally happen. But what exactly does “unusual” mean?

Let’s examine the case of allocating memory on the heap. Ordinarily, this succeeds without a lot of fuss. But what happens if it fails for some reason, say there is no more memory available or the process’s address space is full? In C, the malloc() family functions return NULL to indicate this condition. The program then has to check the return value or simply crash when dereferencing the resulting NULL pointer. In most programs, running out of memory would be an exceptional condition so it would seem to be an excellent candidate for exceptions.

Let’s examine another case. Suppose you are opening a file to read some stuff in it. Now suppose access is denied or the file doesn’t exist. These are exceptions, too, right? Well, not necessarily. If you’re iterating through a list of possible file names to find a file that does exist, or which directory it exists in, experiencing a file not found situation is hardly an unusual circumstance, is it? Having a file with no access permission is likewise fairly normal on systems that support such things. If the file in question happens to be a critical part of the overall program that is running, that might constitute an exception since that is an unusual circumstance meaning that the program is incorrectly installed. But a missing input file to a compiler is likely not an exceptional condition and is one that the compiler will want to deal with directly.

On the surface the two situations above look the same, but they really aren’t. With the second example, the situations described are error conditions, but they are not unusual conditions. They are, in fact, ordinary conditions that are expected to occur in the ordinary operation of the system. They are, by definition, not exceptional.

One of the biggest issues I have with exceptions as they are implemented and used in modern systems is that the notion of an exception has been abused into a general purpose system for returning any error condition, routine or otherwise. I can possibly defend an exception on memory allocation failure, but I have a great deal of trouble accepting that an exception is the right solution when a file simply does not exist.

Modern systems have conflated error handling and exception handling. This may be partly due to the fact that without an exception mechanism, exceptions and ordinary errors have to be handled the same way – by an error indication from a particular call. The other source of the conflation is possibly due to the fact that many “software architects” do not understand why everything is not an exception.

All of that would be relatively minor, however, if the syntax for exceptions were not horribly verbose and insanely tedious to use, even when the language itself provides a solid bug-free implementation. The biggest problem is that this excess verbosity and extra boilerplate code encourages moving the error handling to a point distant from the actual source of the problem. The idea is that the ordinary flow of the code is easy to discover by extracting the error handling from within the main flow. However, this makes the flow non-obvious in the case of something like a missing file. If you do have a solution to a missing file, say by trying a different file, you can’t just continue where you left off if you have one big block and handle all the exceptions at once. Instead, you have to wrap everything that can fail with its own exception handling boilerplate. Now you have duplicated the exact thing you thought you could avoid by using exceptions in the first place! For instance:

result = somefunc();
if (result < 0)
{
    result = tryotherfunc();
    if (result < 0)
    {
        handle_error();
    }
}

if (do_something_with_result(result) < 0)
{
    handle_error();
}

if (do_something_else_with_result(result) < 0)
{
    handle_error();
}

The above is relatively clear. The precise mechanics of how handle_error() would have to work depends on the specific calls, of course. Let’s see what that might look like if we have a language using a “try { code… } catch ….” structure:

try
{
    try
    {
        result = somefunc();
    }
    catch (exception)
    {
        result = tryotherfunc();
    }
    
    do_something_with_result(result);
    do_something_else_with_result(result);
}
catch (exception1)
{
    handle_error();
}

That doesn’t look too bad, does it? But what if we need to do something different depending on which failure condition occurred with, say, do_something_with_result()? And suppose that do_something_else_with_result() might also throw the same exception for a different reason and we can handle that one too? Oops, now we need a try/catch block around each call. We can no longer get away with just the outer one. And for each exception we need to handle, we have to add another catch statement after the try block. It starts to get confusing rapidly.

Of course, you have to add some sort of if/else structure to do the same thing in the first example, too, but that is not going to be any more verbose than the try/catch structure.

There is another case here that should be examined, too. Let’s look at the case where you want to remove a file. Now suppose we are using the first style. If an error occurs, the call returns an indicator of such. If we are using the second style with exceptions, it throws an exception. Now suppose you don’t care if the file didn’t already exist. In the former, you can just ignore the condition. In the latter, you have to explicitly trap the exception and ignore it. The same situation often occurs when closing a file before exiting a program. Sure, errors can occur at that point, but what can you actually do about them? If exceptions are being thrown in that case, the do no good but you would still have to handle them or get mysterious unhandled exceptions at program shutdown.

There is another case where exceptions seem to be the best solution. Suppose the call is to some sort of implicit function, say an object constructor or an overloaded operator (overloading is evil too, by the way). In many languages, this is needed because objects can be constructed implicitly. That means there is no code flow that can trap an error return. The same thing applies to a destructor, which is often called implicitly. It is my contention, however, that if a constructor needs to do something that might fail, (resource acquisition, for instance), it really should not be doing anything implicitly. Instead, some sort of factory operation should be used to obtain a new object. In that case, explicit destruction is then obviously indicated. In short, if an object cannot be implicitly instantiated, there is no need for exceptions. Similarly, if you eliminate the notion of operator overloading, you eliminate the possibility of needing to fail randomly in expressions, too.

I should note that explicit construction is not actually horribly verbose in most cases, and certainly not terribly more verbose than what would be needed to express the static initializer anyway. Also, being explicit with an operation on an object does not preclude being able to chain operations together. It’s easy enough to put an object into an error state that can propagate through chained calculations and then be checked at the end. Sure, you might need multiple statements to chain the calculation together with explicit destruction being required and that might be a bit tedious, but it has the advantage of being absolutely clear what is going on.

And there is the final disadvantage of using exceptions instead of explicit instantiation and destruction, though this is not specific to exceptions per se. It is not at all clear whether a particular usage pattern is doing something dumb or not if you can just chain stuff together with implicit construction and destruction. Sure, you can create resource leaks with explicit destruction, but at least you can see if you are creating and destroying stacks of objects needlessly, and you can bail out in a controlled manner at any stage you like. It is also explicit that something which might fail is happening so you generally have a clear idea what failed if your program crashes at a particular location.

To this point, I haven’t even examined how the actual language exception implementation could be problematic, either. Suppose you have a language with explicit memory management like C++. Now suppose exceptions are objects. You throw an exception, but you have to dynamically allocate it somehow because it would go out of scope as the stack unwinds otherwise. How do you ensure the freshly minted exception gets freed correctly after the exception is handled? What if the object was a static object instead of a dynamic one (which also wouldn’t go out of scope)? What if memory allocation fails when creating the new exception object? Yowza! Things can go wrong in a hurry! Of course, in a language with automatic memory management, like javascript or java, this is much less of a problem. (Though what do you do if the exception object fails to initialize? Throw an exception?)

In short, in my never humble opinion exceptions are useless syntactic HFCS (high fructose corn syrup). They taste great but have at best dubious benefits and more likely are deleterious to maintainability or correctness in the long run. Even when well implemented using something less verbose than try/catch, they do not seem to be a substantial improvement over classical error handling. And, if they are only used for truly exceptional circumstances, the classical error handling would still need to be present for many circumstances (non-exceptional failures).

There is nothing wrong with explicit error handling inline in cleanly structured code. If you think you need exceptions to make your code clear, you are going down the wrong path.

Why Wikis Are Stupid

For a while, having a wiki was what all the cool kids were doing. It was the ultimate in collaboration, allowing anybody who cared to edit any page they wanted to to read anything they wanted. Certain high profile successes (Wikipedia for instance) only served to make the idea even more cool. For a time, wikis sprang up everywhere, even for sites where a traditional content management system made as much or more sense.

What people failed to realize at the time was that the very feature of wikis that makes them so useful for collaboration – their wide open editing policies – also makes them wide open for abuse by the less scrupulous types out there. It should not have come as any surprise that open wikis suddenly became the hottest venue for spammers and other nefarious types to peddle their wares. After all, it happened on UseNET, email, forums, and every other open forum.

These days, running an open wiki requires intense oversight from the administrator. Once the spammers find an open wiki, they will hammer away at it until the end of time, adding ever more garbage to the content. Even simply reverting the spam edits does not remove the contents from the wiki which, after all, stores previous revisions which are still accessible. No, running an open wiki requires being able to permanently remove the garbage and that must be done continually.

Of course, most wikis are really targeted at a fairly small community so restricting edits to authorized users is a reasonable solution. But that also requires some oversight from the administrators. If one simply allows anyone who wishes to create an account and start editing immediately, the abuse noted above will sill occur. After all, spammers can register accounts as easily as anyone else. That means there must be a manual vetting stage for new accounts and that requires administrator intervention. And even once an account is approved, the activity must still be monitored and abusive accounts stopped in their tracks.

In the light of all that, is a wiki a good idea? In the general case, no, it is not. Not even a moderated one. Before putting up a wiki, you should consider carefully whether you really need the functionality. Is there a real benefit to it? Are you really just creating a site where people can discuss things? If so, maybe a forum is better. Are you just trying to get your information out and only a handful of people in your organization will be editing content? If so, a standard content mangaement system is probably all you need.

The fact that wikis are fairly rare compared to the number of sites out there should tell you something. And among the wikis that do exist, almost all require some form of authorization before edit access is allowed. That should also tell you something.

In short, wikis (as originally imagined) are stupid. They simply ignore the nature of the population in general.

Web Site Development and Sessions

Sessions are used all the time by web site developers, often without the developer realizing it. It turns out, however, that sessions are immensely overused and they tend to cause all manner of random problems. My perspective on this is by no means unique but I do wear multiple hats. One hat is as server administrator with hundreds of sites hosted. Another hat is as a web developer. The final relevant hat is as a web site operator. All three hats lead to the following conclusions.

Sessions are over-used

The biggest thing I have noticed over the years is that sessions are overused. Sure, some sort of session makes sense when you need to track a login through portions of a site. But the portions of the site which are public should not need access to any session information, period. If there is no session already in use, there is no need to initiate one if some random member of the public arrives on a public page on your site. You may think you need the session to change the navigation or some other element for a logged in user, and you would be correct, to a point. But if you do not initiate a session for a user unless he logs in, you can still identify a logged in use by the presence of a session combined with whatever session validation you use.

Of course, login tracking is not the only thing sessions get used for. It is simply the most common. However, if you are using a session to track users through your site or something more nefarious, you should consider whether you really need to do that. Are you actually deriving any concrete benefit from doing so? Do you really need a session to collect the information you desire? Do you really need to personalize every page with the visitor’s name or whatever cutesy thing you’re doing?

Sessions are poorly implemented

Completely orthogonal to whether sessions are used needlessly or not is that fact that sessions are often implemented poorly or a session mechanism not well suited for the task at hand is used for whatever reason.

I will pick on a particularly common example of session handling which illustrates several problematic features quite nicely. This particular session handling scheme is the one implemented by default in PHP.

By default, a PHP session exists as a file stored on the web server paired with a cookie that holds the session identifier. When a PHP script activates a session, PHP looks for the cookie and if it finds one, it reads the session data file. But not only does it read the data file, it also locks it, preventing another PHP script from activating the same session at the same time. Then, when the session is released, often implicitly by the end of the script, PHP writes the session data back to the file and finally unlocks it. Note that it rewrites the session data even if nothing has changed.

There are two major things wrong with this approach, as commonly used.

Request serialization

First, because almost nobody writing PHP code knows about the locking or even understands how locking works, this leads to scripts that start with “session_start();” and never release the session. As a result, any scripts that run as part of the same session will run serially. If one script is already running and another tries to start the same session, it will block at session_start() until the previous script finishes.

This is not terribly problematic for cases where only a single script is likely to be running at the same time within a session. However, with the advent of such things as ajax, a single ajax request will block all other ajax requests on the page until it completes. Indeed, even the initial page load might block any ajax requests. Thus, instead of having the page load and asynchronously fill in any ajax type content, instead, elements load up one by one, harkening back to the days of really slow dial-up networking. This can manifest particularly frustratingly to the user who clicks on something while the page is loading or something like that and nothing happens for long seconds while other scripts finish churning away on the server.

But even if the programmer is aware of this problem and defends against it by releasing the session immediately after it is no longer needed, the session still must be maintained for the entire duration where it is possible that session data will need to be modified.

Rewriting unchanged data

The PHP session system also rewrites the session data when nothing has changed. This is generally unseen by users of the site or even by the programmer. The people who notice this are server operators who witness higher write volumes on their disks. This, in turn, leads to slower performance on the server and generally annoys server operators and users alike. However, since this is a problem in aggregate rather than from one single source, there is little a server operator can do to mitigate it.

One interesting thing, though, is that rewriting data needlessly can lead to data corruption in the case a process crashes. Of course, that can happen any time, but if you are not writing data when you crash, there is no chance you corrupt that data. Thus, rewriting unchanged data is generally a dumb idea anyway.

Storing too much

Another common issue with sessions is that too much data is stored in the session manager’s data store. That means in the $_SESSION superglobal in PHP but it could as easily be some custom rolled scheme.

Because the session has to be read by everything that needs information about the current, well, session, the more data that is thrown around, the longer every script takes to execute. If every time something needs the logged in user ID, it also needs to read half a megabyte of other stored state, then you create a needlessly high memory, I/O, processing, and storage overhead for every script. A few milliseconds may not be noticeable on a single script, but consider if your site suddenly gets busy. Those few milliseconds here or there suddenly start adding up to your web server falling over dead.

Instead, use a separate data store for the mutable data like, say, shopping cart contents or cached credentials. Further, don’t cache anything that you can’t easy and quickly recalculate cheaper than the cost of caching it. If you have a rather barroque permissions system, it might make sense to cache that information somewhere, for instance. However, anything you cache, make sure you can recreate it if the cache is unavailable. For other things, like shopping cart data, you might consider using a more permanent storage system and garbage collecting it periodically. If your site already requires access to a database server, for instance, you might consider using that to store the authoritative copy of the cart. Caching might make sense for that sort of thing, but the previously mentioned caveats still apply.

The data stored in the actual session should be small, and largely immutable. There is rarely any more need for anything other than a single identifier, possibly identifying the particular user or particular session. Multiple identifiers might make some sense depending on your circumstances. However, storing the entire result of sorting a list of 500 items in the session store is ridiculous. (That is recalculable and should be cached separately if caching is indicated.)

For those of you well familiar with web technologies, you may have realized that with this sort of minimalist session scheme, the entire “managed” session data can be stored in a single cookie. Indeed, this scheme eliminates both of the PHP specific problems identified above. Also, using a single cookie to store the managed session data largely eliminates any storage bottlenecks on the server as it avoids any uneeded disk writes there.

Session data is unprotected

The final problem I see all the time is that session data is not authenticated or protected at all. The sesson manager is trusted to get this right, and, realistically, it should. That means the PHP session system needs to authenticate its sessions.

Exactly what this means depends a great deal on the session system. PHP can be relatively sure that the session data itself is not visible to a remote user because it is stored in a file on the server. However, other users on the server can potentially read that data. PHP makes no attempt to obfuscate or otherwise encrypt the data it writes on the server. This is likely due to performance concerns and code complexity. Similarly, it makes no attempt to verify that what it is reading from the session file is what it previously wrote out to it. That means any random third party can modify the file and possibly corrupt the session.

Storing everything in a cookie (or even just the session identifier) has a similar problem but now anything in the communication path can potentially see the contents, including proxy servers, possible network sniffers, and software at either end. Thus, some steps need to be taken to be certain that the cookie contents you get back is contents you created in the first place. If you set anything remotely sensitive in the cookie (which you shouldn’t), you also need to make certain the contents cannot be easily read by third parties. Fortunately, relatively common cryptography techniques can be used to provide adequate protection for most situations. (The same techniques can be applied to local cache files, too.) Look up HMAC for more information on such schemes.

Conclusions

The above leads to the following specific advice.

  • Store as little as possible in whatever session manager you use
  • Release the session as soon as possible. If you are only reading data, release it as soon as you’ve read what you need. If you are writing data, do so immediately after writing the data. If what you write doesn’t depend on what you read, you might release the session and re-aquire it later to update it if your script is going to run for a while.
  • Store only data that is likely to remain unchanged once you put it in the session and use other data stores for data that is likely to change.
  • Do not use the session store as a connection-specific cache. If your cached data depends on the session information, use another data store for the cache and store an index in the session.
  • This is more general, but only acquire the resources you need for the script rather than everything all the time. That means do not acquire the shopping cart information when your script is processing a login. A bottleneck on one resource should not affect scripts that do not operate on that resource.
  • If something needs to be locked to maintain consistency, only lock it when you are going to operate on it and unlock it immediately after you are finished. Do not rely on a “big session lock” to do this for you.