Web Site Development and Sessions

Sessions are used all the time by web site developers, often without the developer realizing it. It turns out, however, that sessions are immensely overused and they tend to cause all manner of random problems. My perspective on this is by no means unique but I do wear multiple hats. One hat is as server administrator with hundreds of sites hosted. Another hat is as a web developer. The final relevant hat is as a web site operator. All three hats lead to the following conclusions.

Sessions are over-used

The biggest thing I have noticed over the years is that sessions are overused. Sure, some sort of session makes sense when you need to track a login through portions of a site. But the portions of the site which are public should not need access to any session information, period. If there is no session already in use, there is no need to initiate one if some random member of the public arrives on a public page on your site. You may think you need the session to change the navigation or some other element for a logged in user, and you would be correct, to a point. But if you do not initiate a session for a user unless he logs in, you can still identify a logged in use by the presence of a session combined with whatever session validation you use.

Of course, login tracking is not the only thing sessions get used for. It is simply the most common. However, if you are using a session to track users through your site or something more nefarious, you should consider whether you really need to do that. Are you actually deriving any concrete benefit from doing so? Do you really need a session to collect the information you desire? Do you really need to personalize every page with the visitor’s name or whatever cutesy thing you’re doing?

Sessions are poorly implemented

Completely orthogonal to whether sessions are used needlessly or not is that fact that sessions are often implemented poorly or a session mechanism not well suited for the task at hand is used for whatever reason.

I will pick on a particularly common example of session handling which illustrates several problematic features quite nicely. This particular session handling scheme is the one implemented by default in PHP.

By default, a PHP session exists as a file stored on the web server paired with a cookie that holds the session identifier. When a PHP script activates a session, PHP looks for the cookie and if it finds one, it reads the session data file. But not only does it read the data file, it also locks it, preventing another PHP script from activating the same session at the same time. Then, when the session is released, often implicitly by the end of the script, PHP writes the session data back to the file and finally unlocks it. Note that it rewrites the session data even if nothing has changed.

There are two major things wrong with this approach, as commonly used.

Request serialization

First, because almost nobody writing PHP code knows about the locking or even understands how locking works, this leads to scripts that start with “session_start();” and never release the session. As a result, any scripts that run as part of the same session will run serially. If one script is already running and another tries to start the same session, it will block at session_start() until the previous script finishes.

This is not terribly problematic for cases where only a single script is likely to be running at the same time within a session. However, with the advent of such things as ajax, a single ajax request will block all other ajax requests on the page until it completes. Indeed, even the initial page load might block any ajax requests. Thus, instead of having the page load and asynchronously fill in any ajax type content, instead, elements load up one by one, harkening back to the days of really slow dial-up networking. This can manifest particularly frustratingly to the user who clicks on something while the page is loading or something like that and nothing happens for long seconds while other scripts finish churning away on the server.

But even if the programmer is aware of this problem and defends against it by releasing the session immediately after it is no longer needed, the session still must be maintained for the entire duration where it is possible that session data will need to be modified.

Rewriting unchanged data

The PHP session system also rewrites the session data when nothing has changed. This is generally unseen by users of the site or even by the programmer. The people who notice this are server operators who witness higher write volumes on their disks. This, in turn, leads to slower performance on the server and generally annoys server operators and users alike. However, since this is a problem in aggregate rather than from one single source, there is little a server operator can do to mitigate it.

One interesting thing, though, is that rewriting data needlessly can lead to data corruption in the case a process crashes. Of course, that can happen any time, but if you are not writing data when you crash, there is no chance you corrupt that data. Thus, rewriting unchanged data is generally a dumb idea anyway.

Storing too much

Another common issue with sessions is that too much data is stored in the session manager’s data store. That means in the $_SESSION superglobal in PHP but it could as easily be some custom rolled scheme.

Because the session has to be read by everything that needs information about the current, well, session, the more data that is thrown around, the longer every script takes to execute. If every time something needs the logged in user ID, it also needs to read half a megabyte of other stored state, then you create a needlessly high memory, I/O, processing, and storage overhead for every script. A few milliseconds may not be noticeable on a single script, but consider if your site suddenly gets busy. Those few milliseconds here or there suddenly start adding up to your web server falling over dead.

Instead, use a separate data store for the mutable data like, say, shopping cart contents or cached credentials. Further, don’t cache anything that you can’t easy and quickly recalculate cheaper than the cost of caching it. If you have a rather barroque permissions system, it might make sense to cache that information somewhere, for instance. However, anything you cache, make sure you can recreate it if the cache is unavailable. For other things, like shopping cart data, you might consider using a more permanent storage system and garbage collecting it periodically. If your site already requires access to a database server, for instance, you might consider using that to store the authoritative copy of the cart. Caching might make sense for that sort of thing, but the previously mentioned caveats still apply.

The data stored in the actual session should be small, and largely immutable. There is rarely any more need for anything other than a single identifier, possibly identifying the particular user or particular session. Multiple identifiers might make some sense depending on your circumstances. However, storing the entire result of sorting a list of 500 items in the session store is ridiculous. (That is recalculable and should be cached separately if caching is indicated.)

For those of you well familiar with web technologies, you may have realized that with this sort of minimalist session scheme, the entire “managed” session data can be stored in a single cookie. Indeed, this scheme eliminates both of the PHP specific problems identified above. Also, using a single cookie to store the managed session data largely eliminates any storage bottlenecks on the server as it avoids any uneeded disk writes there.

Session data is unprotected

The final problem I see all the time is that session data is not authenticated or protected at all. The sesson manager is trusted to get this right, and, realistically, it should. That means the PHP session system needs to authenticate its sessions.

Exactly what this means depends a great deal on the session system. PHP can be relatively sure that the session data itself is not visible to a remote user because it is stored in a file on the server. However, other users on the server can potentially read that data. PHP makes no attempt to obfuscate or otherwise encrypt the data it writes on the server. This is likely due to performance concerns and code complexity. Similarly, it makes no attempt to verify that what it is reading from the session file is what it previously wrote out to it. That means any random third party can modify the file and possibly corrupt the session.

Storing everything in a cookie (or even just the session identifier) has a similar problem but now anything in the communication path can potentially see the contents, including proxy servers, possible network sniffers, and software at either end. Thus, some steps need to be taken to be certain that the cookie contents you get back is contents you created in the first place. If you set anything remotely sensitive in the cookie (which you shouldn’t), you also need to make certain the contents cannot be easily read by third parties. Fortunately, relatively common cryptography techniques can be used to provide adequate protection for most situations. (The same techniques can be applied to local cache files, too.) Look up HMAC for more information on such schemes.

Conclusions

The above leads to the following specific advice.

  • Store as little as possible in whatever session manager you use
  • Release the session as soon as possible. If you are only reading data, release it as soon as you’ve read what you need. If you are writing data, do so immediately after writing the data. If what you write doesn’t depend on what you read, you might release the session and re-aquire it later to update it if your script is going to run for a while.
  • Store only data that is likely to remain unchanged once you put it in the session and use other data stores for data that is likely to change.
  • Do not use the session store as a connection-specific cache. If your cached data depends on the session information, use another data store for the cache and store an index in the session.
  • This is more general, but only acquire the resources you need for the script rather than everything all the time. That means do not acquire the shopping cart information when your script is processing a login. A bottleneck on one resource should not affect scripts that do not operate on that resource.
  • If something needs to be locked to maintain consistency, only lock it when you are going to operate on it and unlock it immediately after you are finished. Do not rely on a “big session lock” to do this for you.

Writing Reasonable PHP

PHP gets ragged on a lot for various reasons. One of the biggest complaints I see is that PHP is “insecure” as if writing bad code in PHP is somehow PHP’s fault. The other major complaint is not so much a complaint against the core language as against the standard library and runtime environment and refers to the chaotic nature of the standard functions in particular. Complaints about the latter have merit but PHP is far from the only popular language to have that problem. The former might have some merit but it is just as ridiculous as blaming C because programmers write buffer overflows. It is not strictly PHP’s fault when programmers do stupid things. Granted, PHP makes a lot of stupid things very easy and some of the early design decisions for the PHP runtime environment are questionable in hindsight, but writing sensible PHP code is not impossible or even especially difficult.

Types of PHP Code

Before I delve too far into the intricacies of PHP, let me touch on the types of coding that PHP can be used for.

PHP was designed (or evolved, really) as a means to enhance largely static web pages. It fit into the same niche as Microsoft’s active server pages. It was designed to make adding a small amount of dynamic content to an otherwise largely static page easy. While this is still common today, it is no longer the primary use case. This is also the reason for a lot of the somewhat questionable design decisions for the runtime environment (such as the ever popular and justifiably maligned “register_globals” feature).

As it gained popularity, it began to edge out the use of CGI scripts written in perl or other languages. This was partly due to the complexity of dealing with CGI on most servers and partly due to the fact that PHP itself handled all of the boilerplate stuff needed to deal with the CGI interface – decoding script input primarily. Thus, PHP scripts moved more toward being PHP code with HTML content embedded in it instead of HTML code with PHP embedded in it. Some of the more unfortunate design decisions were addressed at this point (during the 4.x series), including the “register_globals” problem, with the introduction of the “superglobal” arrays and a few other things. PHP also gained a sort of object orientation and a massive collection of “extensions”, many of which are bundled and/or enabled by default. This type of coding is the most common today – programs that are still intended to run in a web server environment and resemble the classic CGI script more than the classic “active page” model.

Finally, PHP gained a command line variant. With a few tweaks to the runtime environment, it became possible to write programs that do not depend on the presence of a web server or the CGI interface specification. Most of the historical runtime design issues do not apply to a command line PHP program. However, the source format remains the same including the PHP open/close tags.

A Sensible PHP Environment

A great deal of sanity can be obtained before a single PHP statement is written by setting up the environment in a sensible manner. Most of the features of PHP that are maligned (often justifiably) by critics can be turned off in the PHP configuration file. Notably, one should turn off register_globals, all magic quotes variants, register_long_arrays, allow_url_include, and allow_url_fopen. There are other configurations that make sense to disable too, depending which extensions you are using.

It should be noted that disabling some of these settings makes coding less convenient. However, often the convenience comes at the cost of clarity or even security.

Writing PHP Code

Most of the recommendations here apply to all programming languages. Let me stress that. Writing good code requires discipline in any language.

Check Inputs

One of the biggest sources of problems with any program is failure to check input data. Anything input by a user must be viewed as suspect. After all, the user might be malicious or simply make an error. Relying on user input to be correct is never the right thing to do. Steps must be taken to ensure that bogus input data does not cause your program to misbehave. Inputs that cannot be handled should produce error conditions in a controlled manner.

Many programmers do grasp this concept intuitively. Input checking code is often present when handling direct user input. However, most overlook the simple fact that data coming from anywhere outside the program code itself must be treated as suspect. You cannot be certain that what you wrote to a data file is still in that file. It could have been corrupted by a hardware failure, user error, or the file could have been replaced with another type of file, all without your program being aware of it. The same applies to data stored in a database system like MySQL or in a session cache or a shared memory cache somewhere.

The advice here: Verify everything. Failure to correctly do so  is not a weakness in PHP but in the programmer. It is also the single largest source of security problems. Careful adherence to this principle will quickly yield much better code.

Check Returns

Closely related to the previous item, and high up on the list of programmer errors, is failing to check return values from function calls. Most library functions will have some sort of return value. For functions that can fail for whatever reason (bad parameters fed in, external state, etc.), it is absolutely critical to check for those failure conditions and handle them in a manner that is appropriate for your program. These conditions can be as simple as a data file being missing or as complicated as a remote socket connection timing out or the database server going away.

Study all function calls you use and make certain you understand what failure conditions exist. If a failure condition will cause your program to fail or otherwise misbehave, handle it. If a failure condition is impossible, it is doubly critical to handle it. That said, if a failure condition will not cause your program to misbehave or otherwise fail, it can be ignored, but make absolutely certain that is the case and document why.

The advice here: Always check return values.

Protect Output

This one is a lot less obvious and is best explained by example. Suppose you are outputting some text into an HTML document and you do not know in advance what characters that text contains. In HTML, some characters have special meanings (such as quotes) but are also valid in actual text. These special characters have to be protected in a medium appropriate way. In the HTML case, they would be replaced with appropriate entities. This is a common case in PHP programming but it is not the only one. The same applies when passing data to a database system like MySQL using SQL or when passing command arguments to an an external program. Failure to protect output properly is the leading cause of a class of security vulnerabilities known as SQL injection attacks. There are analogs for other output streams too. Sometimes the corruption of the output stream is mostly harmless like when an unprotected comma is inserted into a CSV field in an informational spreadsheet. Other times, it can cause cascading failures or even allow clever attackers to obtain private data.

The advice: Always protect output, no matter where it is destined.

Use Correct Operators

This is more specific to PHP but there are similar situations in other languages. In PHP specifically, there are two equality and two inequality operators. One set does loose type handling and attempts to find some means to compare its operands to the point of doing type conversions behind the scenes. The other set will fail if the underlying types of the two operands are different even if the apparent values are the same. The “==” and “!=” operators are the first set and “===” and “!==” are the second set.  Using the former, the string “0” and the number 0 will compare as equal while with the second they will not. This is important because many functions will return “false” on an error but some other type (like a number) on success. If you use the loose comparisons, “false” and “0” are equal but they are not with the strict comparisons.

PHP also has a number of functions which can be used to identify NULL values, arrays, and so on, which can also be employed when the type of a value is important.

In most cases, the strict comparison operator is probably the better choice but the loose comparison can be useful. In short, write what you mean using the correct operators. Make sure you know exactly what the operator you choose is doing.

Using Language Constructs

Like any programming language, PHP has a number of language constructs that are very useful but there are other ways that similar effects can be achieved. For a trivial example, consider the use of a long “if/elseif/elseif/else” structure comparing a single variable against a series of values. This can also be expressed using a “switch” statement. In this trivial example, either one is valid and is about equivalent though the “switch” statement has a few features that might make it more useful in some circumstances. Likewise, a “for” loop can always be faked using “while”.

On the other hand, there are cases where an alternative is not equivalent. Consider the case of “include/require” vs. a function call.. While the fact that you can include the same file in dozens of different places looks a lot like a function call, and can often be used for a similar effect, it is not the same thing. The included code runs in the same scope as the location of the include directive, for instance, which means that any variables in the including file might be scribbled over by the included file. Parameters also must be passed in variables and return values returned the same way. It is also not possible to use such a “function” recursively. On the other hand, an actual function call gains its own local variable scope, preventing the function from clobbering variables in the caller, and also has a formalized parameter list and return value. Furthermore, functions can be called recursively which is also incredibly useful. Thus, it is important to use the right construct for the job. “include” is not the right construct to execute a chunk of code from random locations. (I have singled this particular one out because it shows up far to often in PHP code.)

The advice: use the right language construct for the job. This applies not only to things like “include” but also to things like objects. Creating an object to “encapsulate” a behaviour adequately described by a single function is just as silly as using “while” to simulate “for”.

Wrap Up

The preceding is, by no means, exhaustive. However, by following the above recommendations, it is possible to write reasonable PHP code. All it requires is a bit of discipline and an understanding of the language you are using.

I should note that this is not an apology for PHP but merely a set of suggestions to avoid writing bad code. Remember. Just because PHP allows you to do something in a particularly unfortunate way, it does not mean that you have to do it that way. If it looks like a bad way to do things, look for a better way. Odds are pretty good you will find one.

 

Frameworks – Solution or Problem?

Frameworks are all the rage these days. Frameworks for building web sites. Frameworks for building applications. Frameworks for building databases. Frameworks for building frameworks. Okay, I made the last one up but I’m sure that sufficient noodling around the net will reveal at least fifty dozen attempts to do just that. But are frameworks really all they’re cracked up to be?

Continue reading “Frameworks – Solution or Problem?”

Cleaning Up an IT Mess

So you have a shiny new job managing the IT infrastructure for Acme Widgets Inc, a company that has been in business for decades. You have barely sat down at your desk when the telephone rings. You pick it up but before you can utter a canned greeting, you find yourself being berated for some random failure of something you haven’t had time to even learn about. You shrug and dive in and eventually manufacture a solution to the problem. No sooner have you done that than the telephone rings again with another irate user. And so passes your first day. And your second. And your third. And your thirtieth. And your enthusiasm. Continue reading “Cleaning Up an IT Mess”

Multiple Jobs to Multiple File Volumes in Bacula

I don’t normally go about making howto documents. However, in this case, I have decided to make an exception.

In the latest version of Bacula, it is possible to have multiple jobs operating on the same volume  pool at the same time, even if that volume pool is a collection of disk files. This is particularly useful when utilizing the “Virtual Full Backup” functionality which needs to read one volume and write another simultaneously. However, the documentation is sorely lacking in just how to do this. Furthermore, even google-fu failed to turn up a useful answer, instead turning up only people who ran into the deadlock problem and if there were suggested solutions, it was always “use more than one pool” which doesn’t actually solve the problem.

The deadlock problem arises because Bacula treats file volumes the same as tapes and, thus, requires a “drive” to read or write them. In the classic configuration of such, there is a single drive which can be either reading or writing a single volume at any given time. Thus, one deadlocks with a virtual backup and one can only do one real job at a time (leaving aside interleaving blocks on the volume which is less than ideal in most circumstances).

The solution to this problem is both elegant and simple, but it is not obvious. The solution is to use an autochanger. This is the only way to have multiple devices associated with the same volume pool. However, one cannot simply specify “use an autochanger for these file volumes”. It requires some additional futzing around, including creating a changer script/program to actually handle the changing. It also imposes a slightly different scheme on accessing file volumes. That is, instead of Bacula selecting the correct file from a directory somewhere, and pointing the storage definition at that directory, the changer now has to handle all that and the storage definition has to be pointed at a filename.

All this is best described by example. I will not provide the actual code for the changer program but rather describe what it must do. Suppose you have a set of volumes in a directory /backups/. Suppose you have ten volumes labelled VOL001 thorugh VOL010. These volumes are all in the default volume pool.

First, you would set up the virtual changer. One could, for instance, set up the changer to use a directory structure, which is what will be described here. Suppose that structure is all under the directory /changer/. You will need a means of tracking the contents of the virtual slots and also tracking what is currently mounted in each virtual drive. So lets create /changer/drives/ and /changer/slots/. Under the drives directory, we’ll create a folder for each virtual drive, say “0” and “1” for two drives. In each of those, we’ll create an empty “info” file which will store the name of the mounted volume. Empty means nothing mounted.

In the slot directory, we’ll create a text file for each slot in the changer that is filled. The file will contain the volume file name, not including the path name. Any empty slot will simply not have a file for it. The name of each slot file is simply the slot number. Slots are numbered starting at one. So in this case, we might create a file “1” which contains “VOL001” and “2” which contains “VOL002” and so on.

Now a changer script needs to be created. This can be a simple shell script or it can be written in any programming language you prefer. In this instance, it will accept three parameters. The first is the command from bacula. The second will be the drive number. The third will be the slot number. How these parameters get there is defined in the bacula configuration files which are discussed later.

There are several comands bacula will issue. The first is “list” which provides a list of all slots with media in them in the form of “<number>:<volume>”. The documentation for bacula defines the second field as “barcode” but for our purpose, the volume name will do. This is one reason we store the volume name in the slot files. The output should be silent for slots that do not contain anything.

Another important command is “slots” which simply prints out the number of slots in the changer. This can be an arbitrary number larger than the number of volumes you are using. Even numbers like 500 or 1000 are fine here. It is harmless to set this higher than the number of volumes you have.

The meat of the changer, however, is the “load” and “unload” commands. The unload command can simply remove the “tape” symbolic link for the specified drive and empty the “info” file for it. The “load” command will need to work out the full path name of the volume file for the specified slot and point the “tape” symbolic link at it. It will also need to update the “info” file accordingly. Finally, the “loaded” command simply returns the slot number which is currently loaded into the specified drive, or “0” if none is.

As you can see, the changer program need not be terribly complex though it will need to know something about the volume storage location and so on. It should be easy enough to work out how to code one based on the above description. A word of advice, however: make the program fail clean. Bacula will notice failure exits and handle them accordingly so make sure that you explicitly exit with success if you have a failure that is non-fatal.

Now that we have the infrastructure worked out, it’s time to teach Bacula about it. This is actually the easy part, and believe it or not, can be done with an existing configuration with existing volumes!

First, in the storage director configuration, comment out your existing Device resource. Now add on, using the existing resource as a template (assuming it was already a File type resource). You wil need to add “AutoChanger = yes” and “DriveIndex = N” to it. N is the drive number in the changer. So if you have two drives in the changer, 0 is the first drive and 1 is the second drive. This is important. You will need to add an equivalent Device entry for each drive in the changer.

Also in the storage director configuration, you will need to add an “AutoChanger” resource. Give it a name distinct from the Device resources. In this case, we’ll call it “DiskGroup”. Add a “Device =” entry for each drive in the changer. You can specify any old junk for the required option “ChangerDevice” since we will not be using it. The really important part is the specification for changer command. Assuming you put your program in “/changer/script.sh”, you would add ‘ChangerCommand = “/changer/script.sh %o %d %S”‘. The “%o” is the command bacula wants to do. “%d” is the drive index from the Device configuration. “%S” (note the capital “S”!) is the slot number to operate on, starting at 1. This is where the parameters for the changer program come from.

Now, in the director configuration file, update your storage definition to refer to the device name for the changer (“DiskGroup” in this case). Also, add “AutoChanger = yes” to the definition. You may also wish to set “MaximumConcurrentJobs” to the number of virtual drives (2 in this case) in the changer.

Now you can restart bacula to make sure the configuration changes are noticed. Once done, bring up the console and run “update slots scan”. This will cause bacula to automatically query the changer program to find out what volumes are in what slots and update the catalog appropriately. This is a critical step as it avoids having to manually update the information for each volume which would be tedious and error prone in the case of hundreds of volumes.

Now that’s everything. You can now do multiple jobs onto separate volumes in the same pool. Nothing else needs modification, which is a testament to bacula’s design.

However, if you wish to do virtual backups with both the source and destination in the same pool, you will, of course, need to configure the pool correctly. If the pool is called “Default”, set “NextPool = Default”. If your storage entry is called “Default”, then you would set “Storage = Default”. Both of those in the Pool resource for “Default”. That’s it. There’s nothing more you need to do to make this work.

I’ll leave you with one final note: make sure to test your changer script manually before letting bacula loose on it. Then, once you’re sure it works, run a few small jobs in bacula to make sure it really is working. Run them manually in case intervention is required. You may need to fix your changer program or manually clean up a mess in the changer directory. Once that is working, keep an eye on the automatic stuff for a while to make sure it really is working!

 

The Ivory Tower Trap

As recent developments show, the current addressing scheme on the Internet (IPv4) is nearly depleted. There is, fortunately, a replacement scheme called IPv6 which should last some time. In fact, IPv6 has been around in one form or another since the 1980s. The age of the original discussions and specifications may have some relevance to some of the current designs related to the IPv6 protocol. It seems apparent to a mildly interested observer that some of these design decisions have or will have unfortunate consequences. Continue reading “The Ivory Tower Trap”

IANA IPv4 Endgame Arrives

To borrow terminology, the IPv4 middlegame has ended from the perspective of the IANA. Officially, APNIC received two /8s on January 31 (or February 1, depending where you are. I’m in North America so January 31 it is). That leaves the IANA free pool at five /8s which are already spoken for as the result of a coordinated global policy which allocates one /8 to each RIR when the free pool reaches the number of RIRs, which happens to be five. We will very likely see the announcement of the allocation of the final five /8s very soon. That puts IANA into the endgame scenario for IPv4. No matter how much need or justification or other metric there is for a new allocation, there is nothing left to allocate. Continue reading “IANA IPv4 Endgame Arrives”