Writing cron jobs in PHP

Introduction

Where I used to work we had a web API that while simple to use, can not be used by all our clients that want to integrate with our system. For these clients we sometimes offered a FTP solution that allowed clients to upload files, which we then processed periodically using scripts. These scripts were written in PHP. This article assumes you are using some *nix variant that you can run php on, such as Linux, or MacOS X for example.

Why PHP?

PHP is a natural choice for this job, which was much to the dismay of our system administrator and resident bash and awk guru. These scripts talked to the web API, and manipulated CSV or XML files. PHP has the functionality to handle this readily available, and the programmers who maintained these were generally 'webbies'.

The script

Writing a PHP script to be run over and over, without a web server or even a browser with someone looking at it has it's differences. You want the script just do it's work quietly, and not say anything unless something is wrong. You don't want it even logging the fact that it has run to a log file, if the period is short enough. We have one script that runs every 30 seconds!

Here I will go through making a short script that runs very often and works on larger files. This will demonstrate a few problems these scripts face, namely:

The client is only partially finished uploading the file via FTP.
The previous invocation of the script is still running.

Loading XML and processing it

So lets start with a simple script that reads any xml files it finds in a directory and prints out a particular xml element. Loading a CSV file is not too different if a little more messy, having to match column indices to fields.

#! /usr/bin/env php
<?php

libxml_use_internal_errors( true );

$in_dir = 'incoming/';

$files = glob( $in_dir . '*.xml' );

foreach( $files as $file ) {

    $xml = simplexml_load_file( $file );

    if (!$xml) {
        $errors = libxml_get_errors();

        foreach ($errors as $error) {
            print_r( $error );
        }

        libxml_clear_errors();

    } else {
        if( isset( $xml->name ) ) {
            echo "Name: " . $xml->name . "\n";
        }
    }
}

The she-bang #!

This enables us to make the file executable and run it directly rather than as a command line argument to php. That is you can run it like any other executable. You can do the same with any script on *nix, wheather written in bash, python, Lua or any number of interpreted languages.

To set a file as executable you use chmod.

chmod +x load_and_print.php

The she-bang line tells the shell what program to use to interpret the file. We have /usr/bin/env first for portability. The 'env' program is nearly always in /usr/bin however the program you are running may be installed in different places on other machines. 'env' will use $PATH to find it for you.

The first line is known as a she-bang or hashbang amongst other names.

Next we tell libxml to not print any xml errors immediately, so we can control how these are reported. For now the reporting is very simple. Other than that there is not anything that most PHP programmers have not seen before.

We can test this out easily from the command line by running it, however we can also repeatedly run it easily using the 'watch' command. This command is usually used to run another program over and over, and watch the output.

watch ./load_and_print.php

By default this will run the script every 20 seconds. For Mac users see here.

This way you can leave it running for a while, and use another terminal to try dropping a file in the 'incoming' directory. If you do so you will notice that the output does not change unless you add a file. We're "processing" each file, but redoing them each time. We need to keep track of which files have been done.

The easiest way I found is to simply move the file to a different directory.

We can add the following to the script:

$out_dir = 'processed/';    // Just after the $in_dir line

$moved = $out_dir . basename( $file ); // At the beginning of the foreach loop
rename( $file, $moved );

$xml = simplexml_load_file( $moved ); // change $file to $moved

Renaming the file to another directory may give you the tiniest hint to a potential gotcha. What if the file you are working on is not yet completely there yet? This is very possible when the file is arriving over the Internet.

The rename will work fine even when the file is still being written to. What happens is that the file name is only used to open a file - once it is open and you have a file handle you don't need the name anymore. Having a file open does not stop another process (the script) from renaming it. Renaming is how files are 'moved' as well (when on the same disk). So we can not rely on this to test if the file is ready to be processed.

Checking a file is ready to process

So how do we check if a file is open by another process? There is a *nix command called lsof (LiSt Open Files) that lists open files, and given a file name will list only that file if it is open.

We can add the following function:

function isFileOpen( $filename ) {
  $ret = exec( '/usr/bin/env lsof ' . escapeshellarg( $filename ) );
  if( $ret == '' ) {
    return false;
  }
  return true;
}

The directory where lsof lives may be different for your particular system. Mac, and Red hat based systems (e.g. Fedora, CentOS) have it in /usr/sbin/, Ubuntu at least has it in /usr/bin/. For this reason using the env program as we did for the first line of the script is essential for portability. We now use this function to check if the file is open, and then if it is, we skip the file by continuing the loop. The file may be ready by the time the next run happens.

Making sure only one instance is running at a time

So what happens when the script starts but there is too much to do before it is scheduled to run again? We don't want two instances working on the same set of files. You could have the script at the beginning move all the files into a working directory and process them from there. That way when the next run happens, it either finds no files to process, or a completely new set. Another method is to use a lock file. This is my preferred way as it's only a few of lines of code, and means you will only ever have one instance of the script processing at one time. This prevents loading up your server.

Here's the code, to add to the beginning:

$lock_file = 'load_and_print.lock';

$fp_lock = fopen( $lock_file, 'w' );
if( !flock( $fp_lock, LOCK_EX | LOCK_NB ) ) {
    echo "Could not lock the lock file. An instance is already running.\n";
    die();
}

The lock will be released when the script exits. In this way if a particular instance is busy still processing, subsequent invocations simply exit, rather than banking up, and possibly overloading the server.

Here's the full script again:

#! /usr/bin/env php
<?php

$in_dir = 'incoming/';
$out_dir = 'processed/';
$lock_file = 'load_and_print.lock';

$fp_lock = fopen( $lock_file, 'w' );
if( !flock( $fp_lock, LOCK_EX | LOCK_NB ) ) {
    echo "Could not lock the lock file. An instance is already running.\n";
    die();
}

libxml_use_internal_errors( true );

function isFileOpen( $filename ) {
    $ret = exec( '/usr/sbin/lsof ' . escapeshellarg( $filename ) );
    if( $ret == '' ) {
        return false;
    }
    return true;
}

$files = glob( $in_dir . '*.xml' );

foreach( $files as $file ) {

    if( isFileOpen( $file ) ) {
        continue;   // We'll possibly get this next time
    }

    $moved = $out_dir . basename( $file );
    rename( $file, $moved );

    $xml = simplexml_load_file( $moved );

    if (!$xml) {
        $errors = libxml_get_errors();

        foreach ($errors as $error) {
            print_r( $error );
        }

        libxml_clear_errors();

    } else {
        // Process the file...
        if( isset( $xml->name ) ) {
            echo "Name: " . $xml->name . "\n";
        }
    }
}

Installing the script as a cron job

To run the script periodically, perhaps the easiest method is to run it as a cron job. To do this you need to edit the list using crontab -e which will open the list of cron jobs in your default editor. Here you can add a line similar to the following:

*/1 * * * * cd /where/the/scripts/are; ./my_script.php

This will change directory and then run the script - every minute. Changing the directory first allows you to use shorter relative file names within the script itself. See the documentation on cron for more information on how the time specifying is done. You can do all sorts of schedules. Should you want the script to run more often than one minute, one technique is to do the following:

*/1 * * * * cd /where/the/scripts/are; ./my_script.php
*/1 * * * * sleep 30; cd /where/the/scripts/are; ./my_script.php

This will run the script every 30 seconds.