Parallelising With Perl

Did a talk at the Sydney Perl Mongers group on Tuesday night, called "Parallelising with Perl", covering AnyEvent, MCE, and GNU Parallel.

Slides

CSV wrangling with App::CCSV

Well past time to get back on the blogging horse.

I'm now working on a big data web mining startup, and spending an inordinate amount of time buried in large data files, often some variant of CSV.

My favourite new tool over the last few months is is Karlheinz Zoechling's App::CCSV perl module, which lets you do some really powerful CSV processing using perl one-liners, instead of having to write a trivial/throwaway script.

If you're familiar with perl's standard autosplit functionality (perl -a) then App::CCSV will look pretty similar - it autosplits its input into an array on your CSV delimiters for further processing. It handles embedded delimiters and CSV quoting conventions correctly, though, which perl's standard autosplitting doesn't.

App::CCSV uses @f to hold the autosplit fields, and provides utility functions csay and cprint for doing say and print on the CSV-joins of your array. So for example:

# Print just the first 3 fields of your file
perl -MApp::CCSV -ne 'csay @f[0..2]' < file.csv

# Print only lines where the second field is 'Y' or 'T'
perl -MApp::CCSV -ne 'csay @f if $f[1] =~ /^[YT]$/' < file.csv

# Print the CSV header and all lines where field 3 is negative
perl -MApp::CCSV -ne 'csay @f if $. == 1 || ($f[2]||0) < 0' < file.csv

# Insert a new country code field after the first field
perl -MApp::CCSV -ne '$cc = get_country_code($f[0]); csay $f[0],$cc,@f[1..$#f]' < file.csv

App::CCSV can use a config file to handle different kinds of CSV input. Here's what I'm using, which lives in my home directory in ~/.CCSVConf:

<CCSV>
sep_char ,
quote_char """
<names>
  <comma>
    sep_char ","
    quote_char """
  </comma>
  <tabs>
    sep_char "  "
    quote_char """
  </tabs>
  <pipe>
    sep_char "|"
    quote_char """
  </pipe>
  <commanq>
    sep_char ","
    quote_char ""
  </comma>
  <tabsnq>
    sep_char "  "
    quote_char ""
  </tabs>
  <pipenq>
    sep_char "|"
    quote_char ""
  </pipe>
</names>
</CCSV>

That just defines two sets of names for different kinds of input: comma, tabs, and pipe for [,\t|] delimiters with standard CSV quote conventions; and three nq ("no-quote") variants - commanq, tabsnq, and pipenq - to handle inputs that aren't using standard CSV quoting. It also makes the comma behaviour the default.

You use one of the names by specifying it when loading the module, after an =:

perl -MApp::CCSV=comma ...
perl -MApp::CCSV=tabs ...
perl -MApp::CCSV=pipe ...

You can also convert between formats by specifying two names, in <input>,<output> format e.g.

perl -MApp::CCSV=comma,pipe ...
perl -MApp::CCSV=tabs,comma ...
perl -MApp::CCSV=pipe,tabs ...

And just to round things off, I have a few aliases defined in my bashrc file to make these even easier to use:

alias perlcsv='perl -CSAD -MApp::CCSV'
alias perlpsv='perl -CSAD -MApp::CCSV=pipe'
alias perltsv='perl -CSAD -MApp::CCSV=tabs'
alias perlcsvnq='perl -CSAD -MApp::CCSV=commanq'
alias perlpsvnq='perl -CSAD -MApp::CCSV=pipenq'
alias perltsvnq='perl -CSAD -MApp::CCSV=tabsnq'

That simplifies my standard invocation to something like:

perlcsv -ne 'csay @f[0..2]' < file.csv

Happy data wrangling!

Parallel Processing Perl Modules

Needed to parallelise some processing in perl the last few days, and did a quick survey of some of the parallel processing modules on CPAN, of which there is the normal bewildering diversity.

As usual, it depends exactly what you're trying to do. In my case I just needed to be able to fork a bunch of processes off, have them process some data, and hand the results back to the parent.

So here are my notes on a random selection of the available modules. The example each time is basically a parallel version of the following map:

my %out = map { $_ ** 2 } 1 .. 50;

Parallel::ForkManager

Object oriented wrapper around 'fork'. Supports parent callbacks. Passing data back to parent uses files, and feels a little bit clunky. Dependencies: none.

use Parallel::ForkManager 0.7.6;

my @num = 1 .. 50;

my $pm = Parallel::ForkManager->new(5);

my %out;
$pm->run_on_finish(sub {    # must be declared before first 'start'
    my ($pid, $exit_code, $ident, $exit_signal, $core_dump, $data) = @_;
    $out{ $data->[0] } = $data->[1];
});

for my $num (@num) {
    $pm->start and next;   # Parent nexts

    # Child
    my $sq = $num ** 2;

    $pm->finish(0, [ $num, $sq ]);   # Child exits
}
$pm->wait_all_children;

[Version 0.7.9]

Parallel::Iterator

Basically a parallel version of 'map'. Dependencies: none.

use Parallel::Iterator qw(iterate);

my @num = 1 .. 50;

my $it = iterate( sub {
    # sub is a closure, return outputs
    my ($id, $num) = @_;
    return $num ** 2;
}, \@num );

my %out = ();
while (my ($num, $square) = $it->()) {
  $out{$num} = $square;
}

[Version 1.00]

Parallel::Loops

Provides parallel versions of 'foreach' and 'while'. It uses 'tie' to allow shared data structures between the parent and children. Dependencies: Parallel::ForkManager.

use Parallel::Loops;

my @num = 1 .. 50;

my $pl = Parallel::Loops->new(5);

my %out;
$pl->share(\%out);

$pl->foreach( \@num, sub {
    my $num = $_;           # note this uses $_, not @_
    $out{$num} = $num ** 2;
});

You can also return values from the subroutine like Iterator, avoiding the explicit 'share':

my %out = $pl->foreach( \@num, sub {
    my $num = $_;           # note this uses $_, not @_
    return ( $num, $num ** 2 );
});

[Version 0.03]

Proc::Fork

Provides an interesting perlish forking interface using blocks. No built-in support for returning data from children, but provides examples using pipes. Dependencies: Exporter::Tidy.

use Proc::Fork;
use IO::Pipe;
use Storable qw(freeze thaw);

my @num = 1 .. 50;
my @children;

for my $num (@num) {
    my $pipe = IO::Pipe->new;

    run_fork{ child {
        # Child
        $pipe->writer;
        print $pipe freeze([ $num, $num ** 2 ]);
        exit;
    } };

    # Parent
    $pipe->reader;
    push @children, $pipe;
}

my %out;
for my $pipe (@children) {
    my $entry = thaw( <$pipe> );
    $out{ $entry->[0] } = $entry->[1];
}

[Version 0.71]

Parallel::Prefork

Like Parallel::ForkManager, but adds better signal handling. Doesn't seem to provide built-in support for returning data from children. Dependencies: Proc::Wait3.

[Version 0.08]

Parallel::Forker

More complex module, loosely based on ForkManager (?). Includes better signal handling, and supports scheduling and dependencies between different groups of subprocesses. Doesn't appear to provide built-in support for passing data back from children.

[Version 1.232]

Currency On-Screen Display

Here's a quick hack demonstrating a nice juxtaposition between the power of a CPAN module - in this case Christopher Laco's Finance::Currency::Convert::WebserviceX - and the elegance and utility of the little known osd_cat, putting together a desktop currency rates widget in a handful of lines:

#!/usr/bin/perl

use strict;
use IO::File;
use Finance::Currency::Convert::WebserviceX;

# Configuration
my @currencies = map { uc } @ARGV || qw(USD GBP);
my $base_currency = 'AUD';
my $refresh = 300;   # seconds
my $font = '9x15bold';
# X colours: http://sedition.com/perl/rgb.html
my $colour = 'goldenrod3';
my $align = 'right';
my $pos = 'top';
my $offset = 25;

my $lines = scalar @currencies;
my $osd_refresh = $refresh + 1;
my $osd = IO::File->new(
  "|osd_cat -l $lines -d $osd_refresh -c '$colour' -f $font -p $pos -A $align -o $offset"
) or die "can't open to osd_cat $!";
$osd->autoflush(1);
local $SIG{PIPE} = sub { die "pipe failed: $!" };

my $cc = Finance::Currency::Convert::WebserviceX->new;

while (1) {
  my $output = '';
  $output .= "$_ " . $cc->convert(1, $base_currency, $_) . "\n" for @currencies;
  $osd->print($output);
  sleep $refresh;
}

Most of this is just housekeeping around splitting out various osd_cat options for tweaking, and allowing the set of currencies to display to be passed in as arguments. I haven't bothered setting up any option handling in order to keep the example short, but that would be straightforward.

To use, you just run from the command line in the background:

./currency_osd &

and it shows up in the top right corner of your screen, like so:

alt
text

Tweak to taste, of course.

The Joy of Scripting

Was going home on the train with Hannah (8) this afternoon, and she says, "Dad, what's the longest word you can make without using any letters with tails or stalks?". "Do you really want to know?", I asked, and whipping out the trusty laptop, we had an answer within a couple of train stops:

egrep -v '[A-Zbdfghjklpqty]' /usr/share/dict/words | \
perl -nle 'chomp; push @words, $_;
  END { @words = sort { length($b) cmp length($a) } @words;
        print join "\n", @words[0 .. 9] }'

noncarnivorousness
nonceremoniousness
overcensoriousness
carnivorousnesses
noncensoriousness
nonsuccessiveness
overconsciousness
semiconsciousness
unacrimoniousness
uncarnivorousness

Now I just need to teach her how to do that.

CSS and Javascript Minification

I've been playing with the very nice YSlow firefox plugin recently, while doing some front-end optimisation on a Catalyst web project.

Most of YSlow's tuning tips were reasonably straightforward, but I wasn't sure how to approach the concatenation and minification of CSS and javascript files that they recommend.

Turns out - as is often the case - there's a very nice packaged solution on CPAN.

The File::Assets module provides concatentation and minification for CSS and Javascript 'assets' for a web page, using the CSS::Minifier (::XS) and JavaScript::Minifier (::XS) modules for minification. To use, you add a series of .css and .js files in building your page, and then 'export' them at the end, which generates a concatenated and minified version of each type in an export directory, and an appropriate link to the exported version. You can do separate exports for CSS and Javascript if you want to follow the Yahoo/YSlow recommendation of putting your stylesheets at the top and your scripts at the bottom.

There's also a Catalyst::Plugin::Assets module to facilitate using File::Assets from Catalyst.

I use Mason for my Catalyst views (I prefer using perl in my views rather than having another mini-language to learn) and so use this as follows.

First, you have to configure Catalyst::Plugin::Assets in your project config file (e.g. $PROJECT_HOME/project.yml):

Plugin::Assets:
    path: /static
    output_path: build/
    minify: 1

Next, I set the per-page javascript and and css files I want to include as mason page attributes in my views (using an arrayref if there's more than one item of the given type) e.g.

%# in my person view
&lt;%attr&gt;
js => [ 'jquery.color.js', 'person.js' ]
css => 'person.css'
&lt;/%attr&gt;

Then in my top-level autohandler, I include both global and per-page assets like this:

&lt;%init&gt;
# Asset collation, javascript (globals, then per-page)
$c->assets->include('js/jquery.min.js');
$c->assets->include('js/global.js');
if (my $js = $m->request_comp->attr_if_exists('js')) {
  if (ref $js && ref $js eq 'ARRAY') {
    $c->assets->include("js/$_") foreach @$js;
  } else {
    $c->assets->include("js/$js");
  }
}
# The CSS version is left as an exercise for the reader ...
# ...
&lt;/%init&gt;

Then, elsewhere in the autohandler, you add an exported link at the appropriate point in the page:

&lt;% $c->assets->export('text/javascript') %&gt;

This generates a link something like the following (wrapped here):

&lt;script src="http://www.example.com/static/build/assets-ec556d1e.js"
  type="text/javascript"&gt;&lt;/script&gt;

Beautiful, easy, maintainable.

Catalyst + Screen

I'm an old-school developer, doing all my hacking using terms, the command line, and vim, not a heavyweight IDE. Hacking perl Catalyst projects (and I imagine other MVC-type frameworks) can be slightly more challenging in this kind of environment because of the widely-branching directory structure. A single conceptual change can easily touch controller classes, model classes, view templates, and static javascript or css files, for instance.

I've found GNU screen to work really well in this environment. I use per-project screen sessions set up specifically for Catalyst - for my 'usercss' project, for instance, I have a ~/.screenrc-usercss config that looks like this:

source $HOME/.screenrc
setenv PROJDIR ~/work/usercss
setenv PROJ UserCSS
screen -t home
stuff "cd ~^Mclear^M"
screen -t top
stuff "cd $PROJDIR^Mclear^M"
screen -t lib
stuff "cd $PROJDIR/lib/$PROJ^Mclear^M"
screen -t controller
stuff "cd $PROJDIR/lib/Controller^Mclear^M"
screen -t schema
stuff "cd $PROJDIR/lib/$PROJ/Schema/Result^Mclear^M"
screen -t htdocs
stuff "cd $PROJDIR/root/htdocs^Mclear^M"
screen -t static
stuff "cd $PROJDIR/root/static^Mclear^M"
screen -t sql
stuff "cd $PROJDIR^Mclear^M"
select 0

(the ^M sequences there are actual Ctrl-M newline characters).

So a:

screen -c ~/.screenrc-usercss

will give me a set of eight labelled screen windows: home, top, lib, controller, schema, htdocs, static, and sql. I usually run a couple of these in separate terms, like this:

dual-screen screenshot

To make this completely brainless, I also have the following bash function defined in my ~/.bashrc file:

sc ()
{
  SC_SESSION=$(screen -ls | egrep -e "\.$1.*Detached" | \
    awk '{ print $1 }' | head -1);
  if [ -n "$SC_SESSION" ]; then
    xtitle $1;
    screen -R $SC_SESSION;
  elif [ -f ~/.screenrc-$1 ]; then
    xtitle $1;
    screen -S $1 -c ~/.screenrc-$1
  else
    echo "Unknown session type '$1'!"
  fi
}

which lets me just do sc usercss, which reattaches to the first detached 'usercss' screen session, if one is available, or starts up a new one.

Fast, flexible, lightweight. Choose any 3.

Notes on TheSchwartz

I've been playing around with SixApart's TheSchwartz for the last few days. TheSchwartz is a lightweight reliable job queue, typically used for handling relatively high latency jobs that you don't want to try and handle from a web process e.g. for sending out emails, placing orders into some external system, etc. Basically interacting with anything which might be down or slow or which you don't really need right away.

Actually, TheSchwartz is a job queue library rather than a job queue system, so some assembly is required. Like most Danga/SixApart software, it's lightweight, performant, and well-designed, but also pretty light on documentation. If you're not comfortable reading the (perl) source, it might be a challenging environment to setup.

Notes from the last few days:

  • Don't use the version on CPAN, get the latest code from subversion instead. At the moment the CPAN version is 1.04, but current svn is at 1.07, and has some significant additional functionality.

  • Conceptually TheSchwartz is very simple - jobs with opaque function names and arguments are inserted into a database for workers with a particular 'ability'; workers periodically check the database for jobs matching the abilities they have, and grab and execute them. Jobs that succeed are marked completed and removed from the queue; jobs that fail are logged and left on the queue to be retried after some time period up to a configurable number of retries.

  • TheSchwartz has two kinds of clients - those that submit jobs, and workers that perform jobs. Both are considered clients, which is confusing if you're thinking in terms of client-server interaction. TheSchwartz considers both sides to be clients.

  • There are three main classes to deal with: TheSchwartz, which is the main client functionality class; TheSchwartz::Job, which models the jobs that are submitted to the job queue; and TheSchwartz::Worker, which is a role-type class modelling a particular ability that a worker is able to perform.

  • New worker abilities are defined by subclassing TheSchwartz::Worker and defining your new functionality in a work() method. work() receives the job object from the queue as its only argument and does its stuff, marking the job as completed or failed after processing. A useful real example worker is TheSchwartz::Worker::SendEmail (also by Brad Fitzpatrick, and available on CPAN) for sending emails from TheSchwartz.

  • Depending on your application, it may make sense for workers to just have a single ability, or for them to have multiple abilities and service more than one type of job. In the latter case, TheSchwartz tries to use unused abilities whenever it can to avoid certain kinds of jobs getting starved.

  • You can also subclass TheSchwartz itself to modify the standard functionality, and I've found that useful where I've wanted more visibility of what workers are doing that you get out of the box. You don't appear at this point to be able to subclass TheSchwartz::Job however - TheSchwartz always uses this as the class when autovivifying jobs for workers.

  • There are a bunch of other features I haven't played with yet, including job priorities, the ability to coalesce jobs into groups to be processed together, and the ability to delay jobs until a certain time.

I've actually been using it to setup a job queue system for a cluster, which is a slightly different application that it was intended for, but so far it's been working really well.

I'm still feeling like I'm still getting to grips with the breadth of things it could be used for though - more experimentation required. I'd be interested in hearing of examples of what people are using it for as well.

Recommended.

Finding Core Perl Modules

I wasted 15 minutes the other day trying to remember how to do this, so here it is for the future: to find out if and when a perl module got added to the core, you want Richard Clamp's excellent Module::CoreList.

Recent versions have a 'corelist' frontend command, so I typically use that e.g.

$ corelist File::Basename
File::Basename  was first released with perl 5

$ corelist warnings
warnings  was first released with perl 5.006

$ corelist /^File::Spec/
File::Spec  was first released with perl 5.00405
File::Spec::Cygwin  was first released with perl 5.006002
File::Spec::Epoc  was first released with perl 5.006001
File::Spec::Functions  was first released with perl 5.00504
File::Spec::Mac  was first released with perl 5.00405
File::Spec::OS2  was first released with perl 5.00405
File::Spec::Unix  was first released with perl 5.00405
File::Spec::VMS  was first released with perl 5.00405
File::Spec::Win32  was first released with perl 5.00405

$ corelist URI::Escape
URI::Escape  was not in CORE (or so I think)

Comparing Directories

Saw this post fly past in the twitter stream today:
"http://linuxshellaccount.blogspot.com/2008/03/perl-directory-permissions-difference.html".
It's a script by Mike Golvach to do something like a `diff -r`, but also
showing differences in permissions and ownership, rather than just content.

I've written a CPAN module to do stuff like this - File::DirCompare - so thought I'd check how straightforward this would be using File::DirCompare:

#!/usr/bin/perl

use strict;
use File::Basename;
use File::DirCompare;
use File::Compare qw(compare);
use File::stat;

die "Usage: " . basename($0) . " dir1 dir2\n" unless @ARGV == 2;

my ($dir1, $dir2) = @ARGV;

File::DirCompare->compare($dir1, $dir2, sub {
  my ($a, $b) = @_;
  if (! $b) {
    printf "Only in %s: %s\n", dirname($a), basename($a);
  } elsif (! $a) {
    printf "Only in %s: %s\n", dirname($b), basename($b);
  } else {
    my $stata = stat $a;
    my $statb = stat $b;

    # Return unless different
    return unless compare($a, $b) != 0 ||
      $stata->mode != $statb->mode ||
      $stata->uid  != $statb->uid  ||
      $stata->gid  != $statb->gid;

    # Report
    printf "%04o %s %s %s\t\t%04o %s %s %s\n",
      $stata->mode & 07777, basename($a),
        (getpwuid($stata->uid))[0], (getgrgid($stata->gid))[0],
      $statb->mode & 07777, basename($b),
        (getpwuid($statb->uid))[0], (getgrgid($statb->gid))[0];
  }
}, { ignore_cmp => 1 });

So this reports all entries that are different in content or permissions or ownership e.g. given a tree like this (slightly modified from Mike's example):

$ ls -lR scripts1 scripts2
scripts1:
total 28
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script1.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script3.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script4
scripts2:
total 28
-rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1
-rw-r--r-- 1 gavin users 0 Mar 17 16:41 script1.bak
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:41 script2.bak
-rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3*
-rwxr-xr-x 1 gavin gavin 0 Mar 17 16:41 script3.bak*
-rw-r--r-- 1 gavin gavin 0 Mar 17 16:49 script5

it will give output like the following:

$ ./pdiff2 scripts1 scripts2
0644 script1 gavin gavin                0644 script1 gavin users
0644 script1.bak gavin gavin            0644 script1.bak gavin users
0644 script3 gavin gavin                0755 script3 gavin gavin
0644 script3.bak gavin gavin            0755 script3.bak gavin gavin
Only in scripts1: script4
Only in scripts2: script5

This obviously has dependencies that Mike's version doesn't have, but it comes out much shorter and clearer, I think. It also doesn't fork and parse an external ls, so it should be more portable and less fragile. I should probably be caching the getpwuid lookups too, but that would have made it 5 lines longer. ;-)