Notes on TheSchwartz

I've been playing around with SixApart's TheSchwartz for the last few days. TheSchwartz is a lightweight reliable job queue, typically used for handling relatively high latency jobs that you don't want to try and handle from a web process e.g. for sending out emails, placing orders into some external system, etc. Basically interacting with anything which might be down or slow or which you don't really need right away.

Actually, TheSchwartz is a job queue library rather than a job queue system, so some assembly is required. Like most Danga/SixApart software, it's lightweight, performant, and well-designed, but also pretty light on documentation. If you're not comfortable reading the (perl) source, it might be a challenging environment to setup.

Notes from the last few days:

  • Don't use the version on CPAN, get the latest code from subversion instead. At the moment the CPAN version is 1.04, but current svn is at 1.07, and has some significant additional functionality.

  • Conceptually TheSchwartz is very simple - jobs with opaque function names and arguments are inserted into a database for workers with a particular 'ability'; workers periodically check the database for jobs matching the abilities they have, and grab and execute them. Jobs that succeed are marked completed and removed from the queue; jobs that fail are logged and left on the queue to be retried after some time period up to a configurable number of retries.

  • TheSchwartz has two kinds of clients - those that submit jobs, and workers that perform jobs. Both are considered clients, which is confusing if you're thinking in terms of client-server interaction. TheSchwartz considers both sides to be clients.

  • There are three main classes to deal with: TheSchwartz, which is the main client functionality class; TheSchwartz::Job, which models the jobs that are submitted to the job queue; and TheSchwartz::Worker, which is a role-type class modelling a particular ability that a worker is able to perform.

  • New worker abilities are defined by subclassing TheSchwartz::Worker and defining your new functionality in a work() method. work() receives the job object from the queue as its only argument and does its stuff, marking the job as completed or failed after processing. A useful real example worker is TheSchwartz::Worker::SendEmail (also by Brad Fitzpatrick, and available on CPAN) for sending emails from TheSchwartz.

  • Depending on your application, it may make sense for workers to just have a single ability, or for them to have multiple abilities and service more than one type of job. In the latter case, TheSchwartz tries to use unused abilities whenever it can to avoid certain kinds of jobs getting starved.

  • You can also subclass TheSchwartz itself to modify the standard functionality, and I've found that useful where I've wanted more visibility of what workers are doing that you get out of the box. You don't appear at this point to be able to subclass TheSchwartz::Job however - TheSchwartz always uses this as the class when autovivifying jobs for workers.

  • There are a bunch of other features I haven't played with yet, including job priorities, the ability to coalesce jobs into groups to be processed together, and the ability to delay jobs until a certain time.

I've actually been using it to setup a job queue system for a cluster, which is a slightly different application that it was intended for, but so far it's been working really well.

I'm still feeling like I'm still getting to grips with the breadth of things it could be used for though - more experimentation required. I'd be interested in hearing of examples of what people are using it for as well.

Recommended.

Finding Core Perl Modules

I wasted 15 minutes the other day trying to remember how to do this, so here it is for the future: to find out if and when a perl module got added to the core, you want Richard Clamp's excellent Module::CoreList.

Recent versions have a 'corelist' frontend command, so I typically use that e.g.

$ corelist File::Basename
File::Basename  was first released with perl 5

$ corelist warnings
warnings  was first released with perl 5.006

$ corelist /^File::Spec/
File::Spec  was first released with perl 5.00405
File::Spec::Cygwin  was first released with perl 5.006002
File::Spec::Epoc  was first released with perl 5.006001
File::Spec::Functions  was first released with perl 5.00504
File::Spec::Mac  was first released with perl 5.00405
File::Spec::OS2  was first released with perl 5.00405
File::Spec::Unix  was first released with perl 5.00405
File::Spec::VMS  was first released with perl 5.00405
File::Spec::Win32  was first released with perl 5.00405

$ corelist URI::Escape
URI::Escape  was not in CORE (or so I think)

Comparing Directories

Saw this post fly past in the twitter stream today: http://linuxshellaccount.blogspot.com/2008/03/perl-directory-permissions-difference.html. It's a script by Mike Golvach to do something like a diff -r, but also showing differences in permissions and ownership, rather than just content.

I've written a CPAN module to do stuff like this - File::DirCompare - so thought I'd check how straightforward this would be using File::DirCompare:

#!/usr/bin/perl

use strict;
use File::Basename;
use File::DirCompare;
use File::Compare qw(compare);
use File::stat;

die "Usage: " . basename($0) . " dir1 dir2\n" unless @ARGV == 2;

my ($dir1, $dir2) = @ARGV;

File::DirCompare->compare($dir1, $dir2, sub {
  my ($a, $b) = @_;
  if (! $b) {
    printf "Only in %s: %s\n", dirname($a), basename($a);
  } elsif (! $a) {
    printf "Only in %s: %s\n", dirname($b), basename($b);
  } else {
    my $stata = stat $a;
    my $statb = stat $b;

    # Return unless different
    return unless compare($a, $b) != 0 || 
      $stata->mode != $statb->mode ||
      $stata->uid  != $statb->uid  ||   
      $stata->gid  != $statb->gid;

    # Report
    printf "%04o %s %s %s\t\t%04o %s %s %s\n", 
      $stata->mode & 07777, basename($a), 
        (getpwuid($stata->uid))[0], (getgrgid($stata->gid))[0],
      $statb->mode & 07777, basename($b), 
        (getpwuid($statb->uid))[0], (getgrgid($statb->gid))[0];
  }
}, { ignore_cmp => 1 });

So this reports all entries that are different in content or permissions or ownership e.g. given a tree like this (slightly modified from Mike's example):

$ ls -lR scripts1 scripts2
scripts1:
total 28
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script4
scripts2:
total 28
-rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1
-rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak
-rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3*
-rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3.bak*
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script5

it will give output like the following:

$ ./pdiff2 scripts1 scripts2
0644 script1 gavin gavin                0644 script1 gavin users
0644 script1.bak gavin gavin            0644 script1.bak gavin users
0644 script3 gavin gavin                0755 script3 gavin gavin
0644 script3.bak gavin gavin            0755 script3.bak gavin gavin
Only in scripts1: script4
Only in scripts2: script5

This obviously has dependencies that Mike's version doesn't have, but it comes out much shorter and clearer, I think. It also doesn't fork and parse an external ls, so it should be more portable and less fragile. I should probably be caching the getpwuid lookups too, but that would have made it 5 lines longer. ;-)