Writing Reasonable PHP

PHP gets ragged on a lot for various reasons. One of the biggest complaints I see is that PHP is “insecure” as if writing bad code in PHP is somehow PHP’s fault. The other major complaint is not so much a complaint against the core language as against the standard library and runtime environment and refers to the chaotic nature of the standard functions in particular. Complaints about the latter have merit but PHP is far from the only popular language to have that problem. The former might have some merit but it is just as ridiculous as blaming C because programmers write buffer overflows. It is not strictly PHP’s fault when programmers do stupid things. Granted, PHP makes a lot of stupid things very easy and some of the early design decisions for the PHP runtime environment are questionable in hindsight, but writing sensible PHP code is not impossible or even especially difficult.

Types of PHP Code

Before I delve too far into the intricacies of PHP, let me touch on the types of coding that PHP can be used for.

PHP was designed (or evolved, really) as a means to enhance largely static web pages. It fit into the same niche as Microsoft’s active server pages. It was designed to make adding a small amount of dynamic content to an otherwise largely static page easy. While this is still common today, it is no longer the primary use case. This is also the reason for a lot of the somewhat questionable design decisions for the runtime environment (such as the ever popular and justifiably maligned “register_globals” feature).

As it gained popularity, it began to edge out the use of CGI scripts written in perl or other languages. This was partly due to the complexity of dealing with CGI on most servers and partly due to the fact that PHP itself handled all of the boilerplate stuff needed to deal with the CGI interface – decoding script input primarily. Thus, PHP scripts moved more toward being PHP code with HTML content embedded in it instead of HTML code with PHP embedded in it. Some of the more unfortunate design decisions were addressed at this point (during the 4.x series), including the “register_globals” problem, with the introduction of the “superglobal” arrays and a few other things. PHP also gained a sort of object orientation and a massive collection of “extensions”, many of which are bundled and/or enabled by default. This type of coding is the most common today – programs that are still intended to run in a web server environment and resemble the classic CGI script more than the classic “active page” model.

Finally, PHP gained a command line variant. With a few tweaks to the runtime environment, it became possible to write programs that do not depend on the presence of a web server or the CGI interface specification. Most of the historical runtime design issues do not apply to a command line PHP program. However, the source format remains the same including the PHP open/close tags.

A Sensible PHP Environment

A great deal of sanity can be obtained before a single PHP statement is written by setting up the environment in a sensible manner. Most of the features of PHP that are maligned (often justifiably) by critics can be turned off in the PHP configuration file. Notably, one should turn off register_globals, all magic quotes variants, register_long_arrays, allow_url_include, and allow_url_fopen. There are other configurations that make sense to disable too, depending which extensions you are using.

It should be noted that disabling some of these settings makes coding less convenient. However, often the convenience comes at the cost of clarity or even security.

Writing PHP Code

Most of the recommendations here apply to all programming languages. Let me stress that. Writing good code requires discipline in any language.

Check Inputs

One of the biggest sources of problems with any program is failure to check input data. Anything input by a user must be viewed as suspect. After all, the user might be malicious or simply make an error. Relying on user input to be correct is never the right thing to do. Steps must be taken to ensure that bogus input data does not cause your program to misbehave. Inputs that cannot be handled should produce error conditions in a controlled manner.

Many programmers do grasp this concept intuitively. Input checking code is often present when handling direct user input. However, most overlook the simple fact that data coming from anywhere outside the program code itself must be treated as suspect. You cannot be certain that what you wrote to a data file is still in that file. It could have been corrupted by a hardware failure, user error, or the file could have been replaced with another type of file, all without your program being aware of it. The same applies to data stored in a database system like MySQL or in a session cache or a shared memory cache somewhere.

The advice here: Verify everything. Failure to correctly do so  is not a weakness in PHP but in the programmer. It is also the single largest source of security problems. Careful adherence to this principle will quickly yield much better code.

Check Returns

Closely related to the previous item, and high up on the list of programmer errors, is failing to check return values from function calls. Most library functions will have some sort of return value. For functions that can fail for whatever reason (bad parameters fed in, external state, etc.), it is absolutely critical to check for those failure conditions and handle them in a manner that is appropriate for your program. These conditions can be as simple as a data file being missing or as complicated as a remote socket connection timing out or the database server going away.

Study all function calls you use and make certain you understand what failure conditions exist. If a failure condition will cause your program to fail or otherwise misbehave, handle it. If a failure condition is impossible, it is doubly critical to handle it. That said, if a failure condition will not cause your program to misbehave or otherwise fail, it can be ignored, but make absolutely certain that is the case and document why.

The advice here: Always check return values.

Protect Output

This one is a lot less obvious and is best explained by example. Suppose you are outputting some text into an HTML document and you do not know in advance what characters that text contains. In HTML, some characters have special meanings (such as quotes) but are also valid in actual text. These special characters have to be protected in a medium appropriate way. In the HTML case, they would be replaced with appropriate entities. This is a common case in PHP programming but it is not the only one. The same applies when passing data to a database system like MySQL using SQL or when passing command arguments to an an external program. Failure to protect output properly is the leading cause of a class of security vulnerabilities known as SQL injection attacks. There are analogs for other output streams too. Sometimes the corruption of the output stream is mostly harmless like when an unprotected comma is inserted into a CSV field in an informational spreadsheet. Other times, it can cause cascading failures or even allow clever attackers to obtain private data.

The advice: Always protect output, no matter where it is destined.

Use Correct Operators

This is more specific to PHP but there are similar situations in other languages. In PHP specifically, there are two equality and two inequality operators. One set does loose type handling and attempts to find some means to compare its operands to the point of doing type conversions behind the scenes. The other set will fail if the underlying types of the two operands are different even if the apparent values are the same. The “==” and “!=” operators are the first set and “===” and “!==” are the second set.  Using the former, the string “0” and the number 0 will compare as equal while with the second they will not. This is important because many functions will return “false” on an error but some other type (like a number) on success. If you use the loose comparisons, “false” and “0” are equal but they are not with the strict comparisons.

PHP also has a number of functions which can be used to identify NULL values, arrays, and so on, which can also be employed when the type of a value is important.

In most cases, the strict comparison operator is probably the better choice but the loose comparison can be useful. In short, write what you mean using the correct operators. Make sure you know exactly what the operator you choose is doing.

Using Language Constructs

Like any programming language, PHP has a number of language constructs that are very useful but there are other ways that similar effects can be achieved. For a trivial example, consider the use of a long “if/elseif/elseif/else” structure comparing a single variable against a series of values. This can also be expressed using a “switch” statement. In this trivial example, either one is valid and is about equivalent though the “switch” statement has a few features that might make it more useful in some circumstances. Likewise, a “for” loop can always be faked using “while”.

On the other hand, there are cases where an alternative is not equivalent. Consider the case of “include/require” vs. a function call.. While the fact that you can include the same file in dozens of different places looks a lot like a function call, and can often be used for a similar effect, it is not the same thing. The included code runs in the same scope as the location of the include directive, for instance, which means that any variables in the including file might be scribbled over by the included file. Parameters also must be passed in variables and return values returned the same way. It is also not possible to use such a “function” recursively. On the other hand, an actual function call gains its own local variable scope, preventing the function from clobbering variables in the caller, and also has a formalized parameter list and return value. Furthermore, functions can be called recursively which is also incredibly useful. Thus, it is important to use the right construct for the job. “include” is not the right construct to execute a chunk of code from random locations. (I have singled this particular one out because it shows up far to often in PHP code.)

The advice: use the right language construct for the job. This applies not only to things like “include” but also to things like objects. Creating an object to “encapsulate” a behaviour adequately described by a single function is just as silly as using “while” to simulate “for”.

Wrap Up

The preceding is, by no means, exhaustive. However, by following the above recommendations, it is possible to write reasonable PHP code. All it requires is a bit of discipline and an understanding of the language you are using.

I should note that this is not an apology for PHP but merely a set of suggestions to avoid writing bad code. Remember. Just because PHP allows you to do something in a particularly unfortunate way, it does not mean that you have to do it that way. If it looks like a bad way to do things, look for a better way. Odds are pretty good you will find one.


Frameworks – Solution or Problem?

Frameworks are all the rage these days. Frameworks for building web sites. Frameworks for building applications. Frameworks for building databases. Frameworks for building frameworks. Okay, I made the last one up but I’m sure that sufficient noodling around the net will reveal at least fifty dozen attempts to do just that. But are frameworks really all they’re cracked up to be?

Continue reading “Frameworks – Solution or Problem?”

Cleaning Up an IT Mess

So you have a shiny new job managing the IT infrastructure for Acme Widgets Inc, a company that has been in business for decades. You have barely sat down at your desk when the telephone rings. You pick it up but before you can utter a canned greeting, you find yourself being berated for some random failure of something you haven’t had time to even learn about. You shrug and dive in and eventually manufacture a solution to the problem. No sooner have you done that than the telephone rings again with another irate user. And so passes your first day. And your second. And your third. And your thirtieth. And your enthusiasm. Continue reading “Cleaning Up an IT Mess”

Multiple Jobs to Multiple File Volumes in Bacula

I don’t normally go about making howto documents. However, in this case, I have decided to make an exception.

In the latest version of Bacula, it is possible to have multiple jobs operating on the same volume  pool at the same time, even if that volume pool is a collection of disk files. This is particularly useful when utilizing the “Virtual Full Backup” functionality which needs to read one volume and write another simultaneously. However, the documentation is sorely lacking in just how to do this. Furthermore, even google-fu failed to turn up a useful answer, instead turning up only people who ran into the deadlock problem and if there were suggested solutions, it was always “use more than one pool” which doesn’t actually solve the problem.

The deadlock problem arises because Bacula treats file volumes the same as tapes and, thus, requires a “drive” to read or write them. In the classic configuration of such, there is a single drive which can be either reading or writing a single volume at any given time. Thus, one deadlocks with a virtual backup and one can only do one real job at a time (leaving aside interleaving blocks on the volume which is less than ideal in most circumstances).

The solution to this problem is both elegant and simple, but it is not obvious. The solution is to use an autochanger. This is the only way to have multiple devices associated with the same volume pool. However, one cannot simply specify “use an autochanger for these file volumes”. It requires some additional futzing around, including creating a changer script/program to actually handle the changing. It also imposes a slightly different scheme on accessing file volumes. That is, instead of Bacula selecting the correct file from a directory somewhere, and pointing the storage definition at that directory, the changer now has to handle all that and the storage definition has to be pointed at a filename.

All this is best described by example. I will not provide the actual code for the changer program but rather describe what it must do. Suppose you have a set of volumes in a directory /backups/. Suppose you have ten volumes labelled VOL001 thorugh VOL010. These volumes are all in the default volume pool.

First, you would set up the virtual changer. One could, for instance, set up the changer to use a directory structure, which is what will be described here. Suppose that structure is all under the directory /changer/. You will need a means of tracking the contents of the virtual slots and also tracking what is currently mounted in each virtual drive. So lets create /changer/drives/ and /changer/slots/. Under the drives directory, we’ll create a folder for each virtual drive, say “0” and “1” for two drives. In each of those, we’ll create an empty “info” file which will store the name of the mounted volume. Empty means nothing mounted.

In the slot directory, we’ll create a text file for each slot in the changer that is filled. The file will contain the volume file name, not including the path name. Any empty slot will simply not have a file for it. The name of each slot file is simply the slot number. Slots are numbered starting at one. So in this case, we might create a file “1” which contains “VOL001” and “2” which contains “VOL002” and so on.

Now a changer script needs to be created. This can be a simple shell script or it can be written in any programming language you prefer. In this instance, it will accept three parameters. The first is the command from bacula. The second will be the drive number. The third will be the slot number. How these parameters get there is defined in the bacula configuration files which are discussed later.

There are several comands bacula will issue. The first is “list” which provides a list of all slots with media in them in the form of “<number>:<volume>”. The documentation for bacula defines the second field as “barcode” but for our purpose, the volume name will do. This is one reason we store the volume name in the slot files. The output should be silent for slots that do not contain anything.

Another important command is “slots” which simply prints out the number of slots in the changer. This can be an arbitrary number larger than the number of volumes you are using. Even numbers like 500 or 1000 are fine here. It is harmless to set this higher than the number of volumes you have.

The meat of the changer, however, is the “load” and “unload” commands. The unload command can simply remove the “tape” symbolic link for the specified drive and empty the “info” file for it. The “load” command will need to work out the full path name of the volume file for the specified slot and point the “tape” symbolic link at it. It will also need to update the “info” file accordingly. Finally, the “loaded” command simply returns the slot number which is currently loaded into the specified drive, or “0” if none is.

As you can see, the changer program need not be terribly complex though it will need to know something about the volume storage location and so on. It should be easy enough to work out how to code one based on the above description. A word of advice, however: make the program fail clean. Bacula will notice failure exits and handle them accordingly so make sure that you explicitly exit with success if you have a failure that is non-fatal.

Now that we have the infrastructure worked out, it’s time to teach Bacula about it. This is actually the easy part, and believe it or not, can be done with an existing configuration with existing volumes!

First, in the storage director configuration, comment out your existing Device resource. Now add on, using the existing resource as a template (assuming it was already a File type resource). You wil need to add “AutoChanger = yes” and “DriveIndex = N” to it. N is the drive number in the changer. So if you have two drives in the changer, 0 is the first drive and 1 is the second drive. This is important. You will need to add an equivalent Device entry for each drive in the changer.

Also in the storage director configuration, you will need to add an “AutoChanger” resource. Give it a name distinct from the Device resources. In this case, we’ll call it “DiskGroup”. Add a “Device =” entry for each drive in the changer. You can specify any old junk for the required option “ChangerDevice” since we will not be using it. The really important part is the specification for changer command. Assuming you put your program in “/changer/script.sh”, you would add ‘ChangerCommand = “/changer/script.sh %o %d %S”‘. The “%o” is the command bacula wants to do. “%d” is the drive index from the Device configuration. “%S” (note the capital “S”!) is the slot number to operate on, starting at 1. This is where the parameters for the changer program come from.

Now, in the director configuration file, update your storage definition to refer to the device name for the changer (“DiskGroup” in this case). Also, add “AutoChanger = yes” to the definition. You may also wish to set “MaximumConcurrentJobs” to the number of virtual drives (2 in this case) in the changer.

Now you can restart bacula to make sure the configuration changes are noticed. Once done, bring up the console and run “update slots scan”. This will cause bacula to automatically query the changer program to find out what volumes are in what slots and update the catalog appropriately. This is a critical step as it avoids having to manually update the information for each volume which would be tedious and error prone in the case of hundreds of volumes.

Now that’s everything. You can now do multiple jobs onto separate volumes in the same pool. Nothing else needs modification, which is a testament to bacula’s design.

However, if you wish to do virtual backups with both the source and destination in the same pool, you will, of course, need to configure the pool correctly. If the pool is called “Default”, set “NextPool = Default”. If your storage entry is called “Default”, then you would set “Storage = Default”. Both of those in the Pool resource for “Default”. That’s it. There’s nothing more you need to do to make this work.

I’ll leave you with one final note: make sure to test your changer script manually before letting bacula loose on it. Then, once you’re sure it works, run a few small jobs in bacula to make sure it really is working. Run them manually in case intervention is required. You may need to fix your changer program or manually clean up a mess in the changer directory. Once that is working, keep an eye on the automatic stuff for a while to make sure it really is working!


The Ivory Tower Trap

As recent developments show, the current addressing scheme on the Internet (IPv4) is nearly depleted. There is, fortunately, a replacement scheme called IPv6 which should last some time. In fact, IPv6 has been around in one form or another since the 1980s. The age of the original discussions and specifications may have some relevance to some of the current designs related to the IPv6 protocol. It seems apparent to a mildly interested observer that some of these design decisions have or will have unfortunate consequences. Continue reading “The Ivory Tower Trap”

IANA IPv4 Endgame Arrives

To borrow terminology, the IPv4 middlegame has ended from the perspective of the IANA. Officially, APNIC received two /8s on January 31 (or February 1, depending where you are. I’m in North America so January 31 it is). That leaves the IANA free pool at five /8s which are already spoken for as the result of a coordinated global policy which allocates one /8 to each RIR when the free pool reaches the number of RIRs, which happens to be five. We will very likely see the announcement of the allocation of the final five /8s very soon. That puts IANA into the endgame scenario for IPv4. No matter how much need or justification or other metric there is for a new allocation, there is nothing left to allocate. Continue reading “IANA IPv4 Endgame Arrives”

Ditching Automake and Friends

Some time back, I discussed switching a personal project of mine to using the GNU autotools. At that time, I was overall quite pleased with the GNU build system. However, in the intervening time, I have had more experience with those tools and I have since revised my opinion.

There are two things that annoyed me about the automake/autoconf system. The first is that unless your entire project is in a single directory, it strongly pushes you toward using recursive make. As long as the various subdirectories have no interdependencies, this is, of course, no problem. However, dependency issues start rearing their heads very rapidly if there is even the tiniest interdependency. That, however, is not entirely automake’s fault. I could have insisted on a system that did not use recursive make.

The second, and by for the most annoying, problem is that managing all the autoconf and automake debris very quickly started taking more time than actually developing the project. In fact, nearly all the time involved in generating a release was in making sure the correct files were included to keep autoconf and automake and friends happy.

Now, I decided I needed to get the first issue above sorted out. That meant completely redoing all the automake stuff to eliminate recursion with make. However, upon reflection, I realized my project was gaining nearly nothing from the use of automake and friends. I was using gnulib to get some portability to Windows platforms but that brought a horrible amount of bloat to the distribution and substantially slowed down build times all so I could have the convenience of using a couple of gnu extensions to libc. And gnulib insisted on being invoked using a recursive make.

Okay, so the first step was to rip automake and autoconf out of the project and roll my own Makefile system. Well, I succeeded in doing so decent results. There are some things I could have done a lot better but it works so I’ll tweak it later when I feel like it, if I feel like it. This new system does not invoke make recursively. By doing that, I avoid the problem of editing a file in one directory that should cause files in other directories to be rebuilt instead not being noticed by make. And the little bit of icing on the cake is that it builds considerably faster.

The next step was to eliminate reliance on gnu extensions. This took a bit more time because one of the extensions (argp) was non-trivial to replace. Having done so, however, yields source code that will build under a modern Linux variant as well as under the MingGW32 system.

The final bit was dealing with a few portability issues. Most notably, providing an easy way to select cross-development tools to build binaries and to handle the fact that Windows wants “.exe” on the end of executables. These turned out to be simple enough to orchestrate.

There are still a few bits that need to be added. Most notably is a scheme for packaging the entire thing up without any extraneous files. Now that the build system is sane, or at least comprehensible to a mere mortal, however, that should not prove to be much of a challenge. Indeed, it may even be possible to almost totally automate the process when the time comes!

There is, however, one downside of ditching autoconf. That is that the familiar “configure”, “make”, “make install” process of building a software package no longer applies. To build the package on a system not anticipated by the Makefile may cause some pain to the builder. Then again, there is nothing that says the person encountering problems cannot simply solve it and either send a patch back to me or not as he chooses.

Overall, I think the disadvantages of not using something familiar like autoconf are far outweighed by the advantages of a simpler build system.